专利摘要:
abstract techniques for use in connection with performing optimization using a plurality of objective functions associated with a respective plurality of tasks. the techniques include using at least one computer hardware processor to perform: identifying, based at least in part on a joint probabilistic model of the plurality of objective functions, a first point at which to evaluate an objective function in the plurality of objective functions; selecting, based at least in part on the joint probabilistic model, the first objective function in the plurality of objective functions to evaluate at the identified first point; evaluating the first objective function at the identified first point; and updating the probabilistic joint model based on the results of the evaluation to obtain an updated probabilistic joint model. "Systems and Methods for Performing Bayesian Optimization" The present invention relates to techniques for use in connection with performing optimization using a plurality of objective functions associated with a respective plurality of tasks. techniques include using at least computer hardware processor to perform: identifying, based at least in part on a joint probabilistic model of the plurality of objective functions, from a first point at which an objective function is evaluated in the plurality of objective functions; the selection, based at least in part on the joint probabilistic model, of a first objective function in the plurality of objective functions for evaluation at the first identified point; the evaluation of the first objective function at the first identified point; and updating the joint probabilistic model based on the evaluation results to obtain an updated joint probabilistic model.
公开号:BR112015029806A2
申请号:R112015029806
申请日:2014-05-30
公开日:2020-04-28
发明作者:P. Adams Ryan;Jasper Snoek Roland;Larochelle Hugo;Swersky Kevin;Zemel Richard
申请人:President And Fellows Of Harvard College;The Governing Council Of The University Of Toronto;Scopra Sciences et Génie s.e.c.;
IPC主号:
专利说明:

Descriptive Report of the Invention Patent for “SYSTEMS AND METHODS TO PERFORM BAYESIAN OPTIMIZATION”.
CROSS REFERENCE TO RELATED ORDERS
[0001] This Application claims the benefit under 35 USC § 119 (e) of Provisional Patent Application No. 2 US 61/829090, entitled “TECHNIQUES FOR PERFORMING BAYESIAN OPTIMIZATION”, filed on May 30, 2013 under the Q of the document Attorney H0776.70085US00, Provisional Patent Application n Q US 61/829604, entitled “TECHNIQUES FOR PERFORMING BAYESIAN OPTIMIZATION”, filed on May 31, 2013 under the Q of attorney document H0776.70086US00 and Provisional Patent Application n Q US 61/910837, entitled "TECHNIQUES fOR PERFORMING BAYESIAN OPTIMIZATION", filed on december 2, 2013 under attorney file on H0776.70089US00 Q, each of which is incorporated herein by reference in its entirety.
RESEARCH WITH FEDERAL GOVERNMENT SPONSORSHIP
[0002] This invention is carried out with government assistance under YFA N66001-12-1-4219 granted by the Defense Advanced Research Projects Agency (DARPA). The government has certain rights to the invention.
BACKGROUND
[0003] A machine learning system can be configured to use one or more machine learning techniques (for example, classification techniques, grouping techniques, regression techniques, structured forecasting techniques, etc.) and / or models (for example, statistical models, neural networks, support vector machines, decision trees, graphic models, etc.) to process data. Machine learning systems are used to process data that appears in a wide variety of applications across different domains that include, but are not limited to, analysis.
2/95 if text, machine translation, speech processing, audio processing, image processing, visual object recognition and analysis of biological data.
SUMMARY
[0004] Some modalities are directed to a method for use in conjunction with the realization of optimization with the use of an objective function. The method comprises using at least one computer hardware processor to perform: identify, using an integrated acquisition utility function and a probabilistic model of the objective function, at least one first point at which to evaluate the function objective; evaluate the objective function at least at the first point identified; and updating the probabilistic model of the objective function using the results of the evaluation to obtain an updated probabilistic model of the objective function.
[0005] Some modalities are directed to at least one non-transitory computer-readable storage medium that stores executable processor instructions that, when executed by at least one computer hardware processor, cause at least one computer hardware processor to run. computer performs a method for use in conjunction with performing optimization using an objective function. The method comprises identifying, with the use of an integrated acquisition utility function and a probabilistic model of the objective function, at least a first point in which the objective function is evaluated; evaluate the objective function at least at the first point identified; and updating the probabilistic model of the objective function using the results of the evaluation to obtain an updated probabilistic model of the objective function.
[0006] Some modalities are directed to a system for use in conjunction with the realization of optimization with the use of an objective function. The system comprising at least one processes
3/95 computer hardware pain; and at least one non-transitory computer-readable storage medium that stores executable instructions per processor that, when executed by at least one computer hardware processor, causes at least one computer hardware processor to perform: identify, with use an integrated acquisition utility function and a probabilistic model of the objective function, at least a first point on which the objective function is evaluated; evaluate the objective function at least at the first point identified; and updating the probabilistic model of the objective function using the results of the evaluation to obtain an updated probabilistic model of the objective function.
[0007] In some modalities, including any of the previous modalities, the objective function relates hyperparameter values of a machine learning system to the values that provide a measure of performance of the machine learning system. In some modalities, the objective function relates values of a plurality of hyperparameters of a neural network to identify objects in images to the respective values that provide a measure of the performance of the neural network in the identification of objects in the images.
[0008] In some modalities, which include any of the preceding modalities, the executable instructions per processor cause, in addition, at least one computer hardware processor to perform: identification, using the integrated acquisition utility function and the updated probabilistic model of the objective function, at least a second point at which the objective function is evaluated; and the assessment of objective function at least at the second identified point.
[0009] In some modalities, which include any of the previous modalities, the probabilistic model has at least one pa
4/95 parameter, and the integrated acquisition utility function is obtained at least in part by integrating an initial acquisition utility function in relation to at least one parameter of the probabilistic model.
[0010] In some modalities, which include any of the previous modalities, the initial acquisition utility function is an acquisition utility function selected from the group consisting of: a probability of the improvement utility function, a utility function expected improvement, a regret minimization utility function and an entropy-based utility function.
[0011] In some modalities, which include any of the previous modalities, the probabilistic model of the objective function comprises a Gaussian process or a neural network.
[0012] In some modalities, which include any of the previous modalities, identification is performed at least in part using a Monte Carlo technique via the Markov chain.
[0013] In some modalities, which include any of the previous modalities, the executable instructions per processor cause, in addition, at least one computer hardware processor to perform: the identification of a plurality of points in which the objective function is evaluated ; the assessment of the objective function in each of the plurality of points; and the identification or approximation, based on the results, of a point at which the objective function reaches a maximum value.
[0014] Some modalities are directed to a method for use in conjunction with the realization of optimization with the use of an objective function. The method comprises using at least one computer hardware processor to perform: the assessment of objective function at a first point; before the objective function evaluation
5/95 in the first point to be completed: to identify, based on the probabilities of potential results of the evaluation of the objective function in the first point, a second point different from the first point in which the objective function is evaluated; and the assessment of objective function in the second point.
[0015] Some modalities are directed to a method for use in conjunction with the realization of optimization with the use of an objective function. The method comprises using at least one computer hardware processor to perform: the beginning of the objective function assessment at a first point; before the assessment of the objective function at the first point is completed: identify, based on the probabilities of potential results of the assessment of the objective function at the first point, a second point different from the first point at which the objective function is assessed; and the beginning of the evaluation of the objective function in the second point.
[0016] Some modalities are directed to at least one non-transitory computer-readable storage medium that stores executable processor instructions that, when executed by at least one computer hardware processor, cause at least one hardware processor to computer performs a method for use in conjunction with performing optimization using an objective function. The method to understand: to start the evaluation of the objective function in a first point; before the assessment of the objective function at the first point is completed: identify, based on the probabilities of potential results of the assessment of the objective function at the first point, a second point different from the first point at which the objective function is assessed; and start evaluating the objective function at the second point.
[0017] Some modalities are directed to a system for use in conjunction with the realization of optimization with the use of a
6/95 objective function. The system comprises at least one computer hardware processor; and at least one non-transitory computer-readable storage medium that stores executable instructions per processor that, when executed by at least one computer hardware processor, causes at least one computer hardware processor to perform: the start of the assessment of the objective function in a first point; before the assessment of the objective function at the first point is completed: identify, based on the probabilities of potential results of the assessment of the objective function at the first point, a second point different from the first point at which the objective function is assessed; and the beginning of the evaluation of the objective function in the second point.
[0018] In some modalities, including any of the previous modalities, the objective function relates values of hyperparameters of a machine learning system to the values that provide a measure of performance of the machine learning system.
[0019] In some modalities, which include any of the previous modalities, the objective function relates values of a plurality of hyperparameters of a neural network to identify objects in images to the respective values that provide a measure of the performance of the neural network in the identification of objects in images. [0020] In some embodiments, which include any of the foregoing embodiments, the at least one computer hardware processor comprises a first computer hardware processor and a second computer hardware processor different from the first computer hardware processor, and the executable instructions per processor mean that: at least the first computer hardware processor performs the objective function assessment at the first point; and at least the second process
7/95 computer hardware user perform the objective function evaluation in the second point.
[0021] In some modalities, which include any of the previous modalities, the identification comprises using an acquisition utility function obtained, at least in part, by calculating an expected value of an initial acquisition utility function in relation to the potential values of the objective function at the first point.
[0022] In some modalities, which include any of the previous modalities, the probabilities are obtained using a probabilistic model of the objective function, and the executable instructions per processor cause at least one computer hardware processor to perform: a updating the probabilistic model of the objective function using results from the evaluation of the objective function at the first point and / or the second point to obtain an updated probabilistic model of the objective function.
[0023] In some modalities, which include any of the preceding modalities, the executable instructions per processor cause, in addition, at least one computer hardware processor to perform: identification, using the updated probabilistic model of the objective function, at least a third point at which the objective function is assessed; and the beginning of the objective function assessment at least at the third identified point.
[0024] In some modalities, which include any of the previous modalities, the probabilistic model of the objective function comprises a Gaussian process or a neural network.
[0025] Some modalities are directed to a method for use in conjunction with the realization of the optimization with the use of an objective function that maps elements in a first domain to values in a range. The method comprises using at least one computer hardware processor to perform: the identification
8/95 of a first point in which the objective function is evaluated at least in part with the use of an acquisition utility function and a probabilistic model of the objective function, in which the probabilistic model depends on a mapping from one to a non linear from elements in the first domain to elements in a second domain; the evaluation of the objective function at the first point identified to obtain a corresponding first value of the objective function; and updating the probabilistic model of the objective function using the first value to obtain an updated probabilistic model of the objective function.
[0026] Some modalities are directed to at least one non-transitory computer-readable storage medium that stores executable processor instructions that, when executed by at least one computer hardware processor, make at least one computer hardware processor. computer performs a method for use in conjunction with performing optimization using an objective function that maps elements in a first domain to values in a range. The method comprises: identifying a first point at which the objective function is evaluated at least in part using an acquisition utility function and a probabilistic model of the objective function, in which the probabilistic model depends on a one-to-one mapping non-linear from elements in the first domain to elements in a second domain; and evaluating the objective function at the first point identified to obtain a corresponding first value of the objective function.
[0027] Some modalities are directed to a system for use in conjunction with the realization of the optimization with the use of an objective function that maps elements in a first domain to values in a range. The system comprises at least one computer hardware processor; and at least one storage medium
9/95 non-transitory computer readable instruction that stores executable instructions per processor that, when executed by at least one computer hardware processor, causes at least one computer hardware processor to perform: the identification of a first point in the which the objective function is evaluated at least in part with the use of an acquisition utility function and a probabilistic model of the objective function, in which the probabilistic model depends on a mapping from one to a non-linear element in the first domain to elements in a second domain; the evaluation of the objective function at the first point identified to obtain a corresponding first value of the objective function; and updating the probabilistic model of the objective function using the first value to obtain an updated probabilistic model of the objective function.
[0028] In some modalities, including any of the previous modalities, the objective function relates values of hyperparameters of a machine learning system to the values that provide a measure of performance of the machine learning system.
[0029] In some modalities, which include any of the previous modalities, the objective function relates values from a plurality of hyperparameters of a neural network to identify objects in images to the respective values that provide a measure of the performance of the neural network in the identification of objects in images. [0030] In some modalities, which include any of the previous modalities, the executable instructions per processor that cause, in addition, at least one computer hardware processor to perform: the identification of a second point in which the objective function is evaluated ; the evaluation of the objective function at the second identified point to obtain a second corresponding value
10/95 tooth of objective function; and updating the updated probabilistic model of the objective function using the second value to obtain a second updated probabilistic model of the objective function.
[0031] In some modalities, which include any of the previous modalities, mapping from one to a non-linear one is objective.
[0032] In some modalities, which include any of the previous modalities, the mapping from one to a non-linear one comprises a cumulative distribution function of a Beta distribution.
[0033] In some modalities, which include any of the previous modalities, the acquisition utility function is an integrated acquisition utility function.
[0034] In some modalities, which include any of the previous modalities, the probabilistic model of the objective function is obtained at least in part with the use of a Gaussian process or a neural network.
[0035] Some modalities are directed to a method for use in conjunction with the realization of optimization with the use of a plurality of objective functions associated with a respective plurality of tasks. The method comprises using at least one computer hardware processor to perform: the identification, based at least in part on a joint probabilistic model of the plurality of objective functions, a first point in which an objective function is evaluated in the plurality of objective functions ; the selection, based at least in part on the joint probabilistic model, a first objective function in the plurality of objective functions to evaluate in the first identified point; the assessment of the first objective function at the first identified point; and updating the joint probabilistic model based on the results of the assessment to obtain
11/95 an updated joint probabilistic model.
[0036] Some modalities are directed to at least one non-transitory computer-readable storage medium that stores executable processor instructions that, when executed by at least one computer hardware processor, cause at least one computer hardware processor to run. computer performs a method for use in conjunction with performing optimization using a plurality of objective functions associated with a plurality of tasks. The method comprises: identifying, based at least in part on a joint probabilistic model of the plurality of objective functions, a first point in which an objective function is evaluated in the plurality of objective functions; select, based at least in part on the joint probabilistic model, a first objective function in the plurality of objective functions to evaluate in the first identified point; evaluate the first objective function at the first identified point; and update the joint probabilistic model based on the results of the assessment to obtain an updated joint probabilistic model.
[0037] Some modalities are directed to a system for use in conjunction with the realization of optimization with the use of a plurality of objective functions associated with a respective plurality of tasks. The system comprises at least a computer hardware processor; and at least one non-transitory computer-readable storage medium that stores executable instructions per processor that, when executed by at least one computer hardware processor, cause at least one computer hardware processor to perform: identification, with based at least in part on a joint probabilistic model of the plurality of objective functions, a first point in which an objective function is evaluated in the plurality of objective functions;
12/95 the selection, based at least in part on the joint probabilistic model, a first objective function in the plurality of objective functions to evaluate in the first identified point; the assessment of the first objective function at the first identified point; and updating the joint probabilistic model based on the results of the assessment to obtain an updated joint probabilistic model.
[0038] In some modalities, including any of the previous modalities, the first objective function relates values of hyperparameters of a machine learning system to the values that provide a measure of performance of the machine learning system.
[0039] In some modalities, which include any of the previous modalities, the first objective function relates values of a plurality of hyperparameters of a neural network to identify objects in images to the respective values that provide a measure of the performance of the neural network in the identification of objects in the images.
[0040] In some modalities, which include any of the previous modalities, the executable instructions per processor cause, in addition, at least one computer hardware processor to perform: identification, based at least in part on the updated joint probabilistic model the plurality of objective functions and, a second point in which an objective function is evaluated in the plurality of objective functions; the selection, based at least in part on the joint probabilistic model, a second objective function in the plurality of objective functions for the evaluation in the first identified point; and the evaluation of the second objective function at the first identified point.
[0041] In some modalities, which include any of the previous modalities, the first objective function is different from the second
13/95 objective function.
[0042] In some modalities, which include any of the previous modalities, the joint probabilistic model of the plurality of objective functions, models the correlation between tasks in the plurality of tasks.
[0043] In some modalities, which include any of the previous modalities, the joint probabilistic model of the plurality of objective functions comprises a Gaussian process of vectorial values.
[0044] In some modalities, which include any of the previous modalities, the joint probabilistic model comprises a covariance kernel obtained based, at least in part, on a first covariance kernel modeling correlation among the tasks in the plurality of tasks and a second covariance kernel modeling correlation among the points at which objective functions in the plurality of objective functions can be assessed.
[0045] In some modalities, which include any of the previous modalities, the identification is carried out additionally based on a cost-weighted entropy search utility function.
[0046] The above is a non-limiting summary of the invention, which is defined by the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0047] Various aspects and modalities will be described with reference to the following Figures. It should be noted that the Figures are not necessarily to scale. The items that appear in several Figures are indicated by the same numerical reference or a similar numerical reference in all the Figures in which they appear.
[0048] Figure 1 is a block diagram that illustrates the configuration
14/95 ration of a machine learning system.
[0049] Figures 2A to 2D illustrate the iterative update of a probabilistic model of an objective function, at least in part with the use of an acquisition utility function, according to some modalities of the technology described in this document.
[0050] Figures 3A to 3B illustrate calculating an integrated acquisition utility function, according to some modalities of the technology described in this document.
[0051] Figure 4 is a flowchart of an illustrative process to perform the optimization with the use of an objective function, at least in part with the use of an integrated acquisition function and a probabilistic model of the objective function, according to some modalities technology described in this document.
[0052] Figures 5A to 5F illustrate applications of two torsion functions to the two illustrative non-stationary objective functions.
[0053] Figure 6 is a flowchart of an illustrative process to perform the optimization with the use of an objective function, at least in part with the use of several computer hardware processors, according to some modalities of the technology described in this document. .
[0054] Figure 7 is a flowchart of an illustrative process to perform multitasking optimization at least in part with the use of a joint probabilistic model of several objective functions that correspond to the respective tasks, according to some modalities of the technology described in this document.
[0055] Figure 8 is a block diagram of an illustrative computer system in which the modalities described in this document can be implemented.
DETAILED DESCRIPTION
[0056] Conventional techniques for configuring a system of
15/95 machine learning involves manually defining one or more system parameters and automatically defining one or more other system parameters (for example, through parameter learning values using training data). For example, a machine learning system can have one or more parameters, sometimes called “hyperparameters”, whose values are manually set before the machine learning system is trained (for example, before the values of one or more plus other parameters of the machine learning system to be learned using training data). Hyperparameters can be used during machine learning system training (for example, the learning technique for learning machine learning system parameters may depend on the values of the hyperparameters) and during runtime (for example, the how a trained machine learning system processes new data may depend on the values of the hyperparameters).
[0057] For example, as illustrated in Figure 1, machine learning system 102 can be configured by first setting hyperparameters 104 manually, and subsequently learning, during training stage 110, the values of parameters 106a , based on training data 108 and hyperparameters 104, to obtain learned parameter values 106b. The performance of the configured machine learning system 112 can then be assessed during the evaluation stage 116, using test data 114 to calculate one or more values that provide a measure of performance 118 of the configured machine learning system 112 Performance measure 118 can be a generalization performance measure and / or any other appropriate performance measure.
16/95
[0058] As a non-limiting example, machine learning system 102 can be a machine learning system for object recognition that comprises a multilayer neural network associated with one or more hyperparameters (for example, one or more rates of learning, one or more dropout rates, one or more weight standards, one or more hidden layer sizes, convolutional size when the neural network is a convolutional neural network, pooling size, etc.). Hyperparameters are conventionally defined manually before training the neural network in training data. As another non-limiting example, machine learning system 102 can be a machine learning system for word processing that uses a latent Dirichlet allocation technique to process text in parts, the technique of which involves using a targeted graphic model associated with the various hyperparameters (for example, one or more learning rates, size of text parts to process in each iteration of the graphic model training, etc.). These hyperparameters are conventionally defined manually before training the graphic model directed at training data. As yet another non-limiting example, machine learning system 102 can be a machine learning system for analyzing protein DNA sequences that comprise a support vector machine (for example, a support vector machine structured structure) associated with one or more hyperparameters (for example, one or more regularization parameters, one or more entropy terms, model convergence tolerance, etc.). These hyperparameters are conventionally defined manually before training the support vector machine in training data. It should be noted that these examples are illustrative and that there are many other examples of systems
17/95 but machine learning that have hyperparameters that are conventionally defined manually.
[0059] The performance of machine learning systems (for example, generalization performance) is sensitive to hyperparameters and that manually set the hyperparameters of a machine learning system to “reasonable” values (that is, that manually adjust the machine learning system), as conventionally done, can lead to poor or sub-optimal system performance. In fact, the difference between bad definitions and good hyperparameter definitions may be the difference between a useless machine learning system and one that has modern performance.
[0060] A conventional approach to defining the hyperparameters of a machine learning system is to try different definitions of hyperparameters and evaluate the performance of the machine learning system for each such definition. However, such a brute force research approach is not practical due to the fact that a machine learning system can have a large number of hyperparameters so that there are many different definitions that would have to be evaluated. Furthermore, evaluating the performance of the machine learning system for each definition of hyperparameters can take a long time and / or consume a large amount of computational resources due to the fact that the machine learning system does not need to be retained for each definition hyperparameters, which is very demanding in terms of computation as many machine learning systems are trained using very large training data sets (for example, training a machine learning system can take days ). As a result, although there may be time and / or computational resources to assess a small number of
18/95 hyperparameter definitions, exhaustively testing numerous permutations of possible hyperparameter definitions may not be feasible. [0061] Another conventional approach to define hyperparameters of a machine learning system is to use Bayesian optimization techniques. This approach involves treating the problem of defining hyperparameters for a machine learning system as an optimization problem whose objective is to find a set of hyperparameter values for a machine learning system that correspond to the best performance of the machine learning system. and apply an optimization technique to solve this optimization problem. For this purpose, the relationship between the hyperparameter values of a machine learning system and its performance can be considered an objective function for the optimization problem (that is, the objective function maps the hyperparameter values of a learning system by machine to the respective values that provide a measure of performance of the machine learning system), and solving the optimization problem involves finding one or more extreme points (for example, local minimum, local maximum, global minimum, global maximum, etc. ) in the field of objective function. However, this objective function is not known in close proximity (for example, analytically) by any practical machine learning system whose performance depends not only on the values of its hyperparameters, but also on the training data used to train the learning system. per machine and other factors (for example, as shown in Figure 1, performance measurement 118 depends not only on hyperparameters 104, but also on training data 108, test data 114, details of training procedure 110, etc.) . Furthermore, although the objective function can be endorsed in terms of a point (for example
19/95 pio, for each definition of hyperparameter values of a machine learning system, a value of providing a measure of performance of the machine learning system can be obtained), each such evaluation may require a significant amount of time and / or power to be realized.
[0062] Thus, optimization techniques that require a closed analytical representation of the objective function (for example, techniques that display the calculation of gradients) and / or a large number of objective function evaluations (for example, interior point) are not viable approaches to identify the hyperparameter values of machine learning systems. On the other hand, Bayesian optimization techniques require neither an exact knowledge of the objective function nor a large number of objective function evaluations. Although Bayesian optimization techniques rely on objective function assessments, they are designed to reduce the number of such assessments.
[0063] Bayesian optimization involves building a probabilistic model of the objective function based on previously obtained evaluations of the objective function, updating the probabilistic model based on new objective function evaluations that become available, and using the probabilistic model to identify extreme points of the objective function (for example, one or more local minimum, local maximum, global minimum, global maximum, etc.). The probabilistic model, together with a so-called acquisition utility function (examples of which are revealed in more detail below), is used to make informed decisions about where to evaluate the objective function to follow and the new assessments can be used to update the probabilistic model objective function. Thus, the number of objective function evaluations performed to obtain a probabilistic model that accurately represents the objective function with high confidence
20/95 can be reduced. The higher the fidelity of the probabilistic model to the underlying objective function, the more likely that one or more extreme points identified using the probabilistic model correspond to the extreme points (for example, they are good estimates / approximations of them) of the objective function.
[0064] In this way, the conventional Bayesian optimization approach to define hyperparameters of a machine learning system involves building a probabilistic model for the relationship between the hyperparameter values of a machine learning system and its performance and with the use of this probabilistic model together with an acquisition utility function to make informed decisions about which hyperparameter values to try. In this way, the number of times that the performance of a machine learning system is evaluated for the sets of hyperparameter values can be reduced.
[0065] The inventors recognized that conventional Bayesian optimization techniques, including conventional Bayesian optimization techniques to define the hyperparameters of machine learning systems, can be improved. The inventors recognized that a disadvantage of conventional Bayesian optimization techniques is that their performance is extremely sensitive to the values of the parameters of the probabilistic model of the objective function (for example, a small change in the parameter values of the probabilistic model can lead to a major change in the overall performance of the Bayesian optimization technique). In particular, the inventors observed that the acquisition utility function used in Bayesian optimization to identify points at which to evaluate the objective function below (for example, to identify the next set of hyperparameter values for which to evaluate the performance of a machine learning system), is sensitive to the values of the
21/95 parameters of the probabilistic model of the objective function, which can lead to poor overall performance of the Bayesian optimization technique.
[0066] In this way, some modalities are directed to carry out the Bayesian optimization with the use of an integrated acquisition utility function obtained by averaging several acquisition functions, each of which corresponds to the different model parameter values probabilistic (such averaging is sometimes called “integral” in relation to the parameters of the probabilistic model). The integrated acquisition utility function may be less sensitive to the parameters of the probabilistic model of the objective function, which can improve the strength and performance of conventional Bayesian optimization techniques.
[0067] The inventors recognized that another disadvantage of conventional Bayesian optimization techniques, including conventional Bayesian optimization techniques for defining the hyperparameters of machine learning systems, is that conventional Bayesian optimization techniques are sequential techniques required by choosing the next point in which the objective function is evaluated (for example, identifying the next set of hyperparameter values for which the performance of a machine learning system is evaluated) based on the results of all previous objective function evaluations. Therefore, each assessment of the objective function must be completed before the next point at which it is assessed whether the objective function is identified. As such, all objective function evaluations are performed sequentially (that is, one at a time).
[0068] In this way, some modalities are directed to parallel the Bayesian optimization so that the various evaluations of the objective function can be carried out in parallel (for example, so that the different hyperparameter values for a machine learning system can be concor
22/95 recently evaluated, for example, with the use of different computer hardware processors). In these modalities, the next point at which the objective function is evaluated can be selected before the completion of one or more previously initiated evaluations of the objective function, but the selection can be made based on the respective probabilities of potential results of pending evaluations of the objective function. so that some information about the pending evaluations (for example, the specific points at which the evaluation is being carried out) is taken into account when selecting the next point at which the objective function is evaluated. Parallelization of objective function evaluations can be useful when the objective function evaluation is computationally expensive, for example, as may be the case, for example, when identifying the hyperparameter values for machine learning systems that take a long time ( for example, days) to train.
[0069] The inventors recognized that another disadvantage of conventional Bayesian optimization techniques, including conventional Bayesian optimization techniques for defining the hyperparameters of machine learning systems, is that conventional Bayesian optimization techniques use a stationary Gaussian process to model the objective function (for example, using a stationary Gaussian process to model the relationship between the hyperparameter values of a machine learning system and its performance), which may not be an adequate probabilistic model for objective functions that do not are stationary. For example, a stationary Gaussian process may not be an adequate model for a non-stationary objective function because the second order statistics of a stationary Gaussian process are the invariant translation (for example, the covariance kernel of the Gaussian process is invariant translation), while
23/95 second-order statistics of the non-stationary objective function may not be the invariant translation.
[0070] In this way, some modalities are directed to carry out Bayesian optimization with the use of a probabilistic model adapted to more accurately model the non-stationary and stationary objective functions. In some modalities, the probabilistic model of the objective function can be specified based at least in part on a mapping from one to a non-linear element in the domain of the objective function. In the modalities where the probabilistic model comprises a Gaussian process, the covariance kernel of the Gaussian process can be specified at least in part using the mapping from one to a non-linear one.
[0071] The inventors recognized that another disadvantage of Bayesian optimization techniques is that, when applied to solve a specific optimization task, they cannot take the advantage of the information obtained during past applications of these same techniques to a related optimization task. . For example, a machine learning system (for example, a neural network to identify objects in a set of images) can be applied to different data sets (for example, different sets of images), but conventional Bayesian optimization techniques they require identifying the machine learning system hyperparameters again for each data set (for example, for each set of images). None of the information previously obtained while identifying the hyperparameters for a machine learning system using a data set (for example, whose hyperparameter values make the machine learning system perform well and whose hyperparameter values do that the machine learning system performs poorly) can be used to
24/95 identify hyperparameter values for the same machine learning system using another data set.
[0072] In this way, some modalities are directed to Bayesian optimization techniques that, when applied to solve a specific optimization task, can take advantage of information obtained while solving one or more other related optimization tasks. For example, in some modalities, the information obtained while defining the hyperparameters for a machine learning system using a first set of data can be applied to define the hyperparameters of the machine learning system using a second different data set than the first data set. In this way, the information previously obtained can be used to define hyperparameters for the machine learning system more efficiently (for example, with the use of fewer objective function evaluations that can be computationally expensive to perform). More generally, optimization for several different optimization tasks can be performed more efficiently due to the fact that the information obtained by solving one of the optimization tasks can be used to solve another optimization task.
[0073] Some modalities of technology described in this document address some of the disadvantages discussed above of conventional Bayesian optimization techniques, including Bayesian optimization techniques to define hyperparameters of machine learning systems. However, not every modality addresses all disadvantages, and some modalities may not address any of them. As such, it should be noted that the technology aspects described in this document are not limited to addressing all or any of the disadvantages discussed above of the techniques
25/95 conventional Bayesian optimization.
[0074] It should also be noted that the modalities described in this document can be implemented in any of the countless ways. The examples of specific deployments provided below for illustrative purposes only. It should be noted that these modalities and the resources / capabilities provided can be used individually, together or in any combination of two or more, as the technology aspects described in this document are not limited in this regard.
[0075] In some modalities, Bayesian optimization techniques involve building a probabilistic model of an objective function based on one or more previously obtained evaluations of the objective function, and updating the probabilistic model based on any new evaluations of the objective function that are become available. Thus, in some modalities, optimization with the use of an objective function can be performed iteratively (for one or several iterations) by performing, in each iteration, actions to: identify a point at which the objective function is evaluated with the use of an acquisition utility function and a probabilistic model of the objective function, evaluate the objective function at the identified point, and update the probabilistic model based on the evaluation results. The Bayesian optimization techniques described in this document can be applied to any of the numerous types of objective functions that arise in different applications.
[0076] As described above, a non-limiting example of an objective function to which the Bayesian optimization techniques described in this document can be applied is an objective function that relates the values of one or more hyperparameters of a machine learning system to respective values that provide a measure of machine learning system performance
26/95 configured with these hyperparameter values (for example, a machine learning system trained at least in part using these parameters and / or processing new data at least in part using these parameters). A non-limiting example of such a machine learning system is a machine learning system for recognizing objects in images that use a neural network (for example, a multilayer neural network, a convolutional neural network, a feed forwarding neural network , a recurrent neural network, a radial based function neural network, etc.) and / or any other machine learning technique suitable for recognizing objects in images. Examples of hyperparameters for such a machine learning system have been provided above. Another non-limiting example of such a machine learning system is a machine learning system for processing natural language text (for example, identifying one or more topics in the text, text mining, etc.) that uses latent Dirichlet allocation (LDA), probabilistic latent semantic analysis, hierarchical LDA, non-parametric LDA and / or any other machine learning technique suitable for processing natural language text. Such machine learning systems can be adapted to process large sets (for example, one or more bodies) of natural language text. Examples of hyperparameters for such a machine learning system have been provided above. Another non-limiting example of such a machine learning system is a machine learning system for analyzing biological data (for example, a machine learning system for predicting protein motifs) using a vector machine. support (for example, a linear support vector machine, a latent structured support vector machine, any suitable maximum margin classifier, etc.) and / or which
27/95 wants a machine learning technique suitable for processing biological data. Other non-limiting examples of machine learning systems to which the Bayesian optimization techniques described in this document can be applied (to define the hyperparameters of the machine system) include, but are not limited to, machine learning systems for image processing medical (for example, machine learning systems to identify anomalous objects in medical images, such as objects attributable to and / or that may indicate the presence of disease), machine learning systems for processing ultrasonic data, machine learning systems to model data of any suitable type using non-linear adaptive function regression, machine learning systems for processing radar data, machine learning systems for speech processing (for example, speech recognition, identification speaker, speaker dailyization, natural language understanding, etc.), and and machine learning for machine translation.
[0077] It should be noted that the Bayesian optimization techniques described in this document are not limited to being applied to define the hyperparameter values of machine learning systems and, in some modalities, can be applied to other problems. As a non-limiting example, the Bayesian optimization techniques described in this document can be applied to an objective function that relates parameters of an image and / or video compression algorithm (for example, one or more parameters specified by one or more of the JPEG compression standards, one or more parameters specified by one or more of the MPEG standards, etc.) to an image performance measure and / or video compression algorithm. As another example,
28/95 limiting, the Bayesian optimization techniques described in this document can be applied to an objective function that relates parameters of a computer vision system (for example, a computer vision system for object recognition, pose estimation, tracking people and / or objects, optical flow, scene reconstruction, etc.). As another non-limiting example, the Bayesian optimization techniques described in this document can be applied to an objective function that relates parameters from a non-linear control system (for example, a control system to control one or more robots) to the control system performance. As another non-limiting example, the Bayesian optimization techniques described in this document can be applied to an objective function that at least lists the parameters that partially characterize a structure that is designed (parameters that at least partially characterize an airplane wing) for the performance of the structure (for example, if the airplane wing has adequate desired lift characteristics). The examples above are not exhaustive and, more generally, the Bayesian optimization techniques described in this document can be applied to any objective function that may be computationally expensive to evaluate and / or any other objective function that arises in any suitable optimization problem, as the Bayesian optimization techniques described in this document are not limited by the type of objective function to which they can be applied.
[0078] As described above, in some modalities, the Bayesian optimization techniques described in this document involve generating a probabilistic model of an objective function for a specific task (for example, an objective function that relates hyperparameters of a machine learning system To your
29/95 performance). Any suitable type of probabilistic model of the objective function can be used. In some modalities, the probabilistic model may comprise a Gaussian process, which is a stochastic process that specifies a distribution over functions. A Gaussian process can be specified by an average function w; - r - '^ is a covariance function (sometimes called a “kernel” function). For example, when the objective function relates to the hyperparameters of a machine learning system to its performance, the Gaussian process is defined in the hyperparameter space such that the average function maps sets of hyperparameter values (each set of hyperparameter values that corresponds to the values of one or more hyperparameters of the machine learning system) to the real numbers and the covariance function represents the correlation between the sets of hyperparameter values.
[0079] The covariance function can be specified at least in part by a kernel and any of the numerous types of kernels can be used. In some embodiments, a Matérn kernel can be used. As a non-limiting example, a Matérn 5/2 (K M52 ) kernel can be used, whose kernel can be defined according to:
where 0 O and er are kernel parameters, and where x and x 'are points in the domain in which the Gaussian process is defined (for example, x and x' can represent sets of hyperparameter values for a machine learning system). The Matérn 5/2 kernel may be preferable to other kernel choices due to the fact that the induced Gaussian process has favorable properties (for example, the sample paths of the Gaussian process can be twice differentiable). However, a Gaussian process specified using other kernels can be used. Examples of kernels that
30/95 can be used include, but are not limited to, an exponential square kernel of automatic relevance determination, a rational quadratic kernel, a periodic kernel, a locally periodic kernel, a linear kernel and a kernel obtained by combining (for example, multiplication, addition, etc.) of any of the kernels mentioned above. [0080] A probabilistic model of an objective function that comprises a Gaussian process can be used to calculate an estimate of the objective function by computing the expected mean of the Gaussian process given all previously obtained evaluations of the objective function. The uncertainty associated with this estimate can be calculated by computing the expected covariance of the Gaussian process given all previously obtained assessments of the objective function. For example, the envisaged medium and the covariance for a Gaussian process in functions / ; > M given the N previously obtained evaluations {y n
1 <n <N} of an objective function in the set of points X = {x ;; , can be expressed as: μ, x. x, = XíX.xV ΛΊΧ.ΧΓ% - ín (X)) ( 2 ) Σίχ.χΒ X : · /: {x H. / aJM) - ΑΊχ, χ '') - A (X, x) T A * (X. X) - on what A: A x A '·><θ 0 k eme | d 0 Gaussian process, A (X; x) θ o
N-dimensional column vector of covariance crossed between x and ο X defined, · ί ,; ( χ χ · is the Gram matrix for ο defined X, y is the N for 1 vector of evaluations, m (X) is the vector of averages of the Gaussian process at points in the defined X, and Θ is the set of one or more other parameters of the Gaussian process (for example, kernel parameters).
[0081] It should be noted that the probabilistic model for an objective function is not limited to understanding a Gaussian process model. As a non-limiting example, the probabilistic model for an objective function can comprise a neural network whose weights are random variables such that the neural network specifies
31/95 that a distribution in a set of functions. The neural network can be the convolutional neural network, a deep neural network and / or any other suitable type of neural network. As another non-limiting example, the probabilistic model for an objective function can comprise an adaptive based function regression model. [0082] As a non-limiting example, in such modalities, the probabilistic model can comprise a Bayesian linear regression model specified as a linear combination of N nonlinear base functions {φ (χ}, where N is a larger integer or equal to one. Non-linear base functions {φ (χ)} can be obtained at least in part using a multilayer neural network. For example, in some modalities, non-linear base functions can be obtained training a multilayer neural network (for example, using appropriate training techniques) and using a projection of the strands to the last layer hidden in the multilayer neural network as the basis of nonlinear function. can then be used as a resource representation for the Bayesian linear regression model, which can be expressed as follows.
[0083] Let Φ denote a D χ N matrix that results from the concatenation of the base functions {φ (χ η ); 1 ^ n <N} obtained by projecting N entries {x n ; 1 <n <N} for the final layer of a multilayer neural network. Then, the regression model servations y given the {x n } ppg inputs. / '6 = Λ ÍJ 111'Φ;Χ;. Σ · / - 'I) y = bt «. A T x = {χ Λ } J = i
5 and Φ θ the covariance matrix induced by the N input points under the scale hyperparameter θ α The predicted distribution for the output y that corresponds to an input x can be expressed as linear Bayesian so that it can be expressed as ui. = 4 - y
32/95 Z '' W χ.Χ, ίΛ / Q Λ · Í; X; Ui 'wix..ί, σ-IX); ' in which crpt) θ p Qr / - g -4-ϊ <ί) -1 4 »(χ ; ) ;
[0084] Regardless of the type of probabilistic model used to model the objective function, the probabilistic model can be used to obtain an estimate of the objective function and a measure of uncertainty associated with the estimate. For example, when the objective function relates the hyperparameter values of a machine learning system to its performance, the objective function estimate obtained based on the probabilistic model can provide an estimate of the performance of the machine learning system for each set of hyperparameter values and the measure of uncertainty associated with the estimate can provide a measure of uncertainty (for example, a variance, a security, etc.) associated with estimating how well the machine learning system performs for a specific set of values hyperparameter. The different amounts of uncertainty can be associated with estimates of the performance of the machine learning system that corresponds to the different hyperparameter values. For some hyperparameter values the probabilistic model may have the ability to provide a high security estimate (for example, an estimate associated with a low variance) of the performance of the machine learning system when configured with those hyperparameter values, while for others hyperparameter values the probabilistic model can provide a low security estimate (for example, an estimate associated with a high variance) of the performance of the machine learning system when configured with these hyperparameter values.
[0085] The probabilistic model of an objective function can be used to obtain an estimate of the objective function in any of the countless ways. As a non-limiting example, the model
33/95 probabilistic can be used to calculate an estimate of the objective function by calculating the estimated average estimate of the objective function under the probabilistic model given all previous observations (ie, evaluations) of the objective function, and to calculate the associated measure of uncertainty as the predicted covariance. Such calculations can be performed by any of the numerous types of probabilistic models that include Gaussian processes (for example, according to the equations provided above), the function regression model with adaptive bases (of which the neural network models are an example. ), and any other suitable models.
[0086] As can be seen from the examples above, in some modalities, the probabilistic model for an objective function can specify a probability distribution in a set of functions (for example, a set of functions that is believed to include the function objective or another function that closely approximates the objective function). This probability distribution can specify a probability value at which one or more functions in the set of functions, the probability value for a specific function indicates the probability that the function is the objective function. For example, a Gaussian process can be considered to induce a distribution in the set of functions in the space in which the Gaussian process is defined. For example, a Gaussian process can be used to specify a distribution over the set of all possible objective functions (for example, the set of all objective functions that relate hyperparameter values of a machine learning system to the corresponding performance of the learning system. machine learning).
[0087] In some modalities, a probabilistic model of an objective function can be updated based on new information obtained about the objective function. The updated distribution can
34/95 be more concentrated than the initial distribution and, as such, can provide a lower uncertainty representation of the objective function. The updated distribution can be used to compute various estimates of the objective function. As discussed above, an objective function may not be known in the closed form and information about the objective function can be obtained through evaluation towards the point of the objective function. For example, information about an objective function that relates the hyperparameters of a machine learning system to its performance can be obtained by evaluating the performance of the machine learning system for each of the one or more definitions of the hyperparameters. Thus, in some modalities, a probabilistic model of an objective function can be updated based on one or more objective function evaluations to reflect the additional information learned about the objective function through the new evaluation (s) . For example, in modalities where the probabilistic model of an objective function comprises a Gaussian process, the Gaussian process can be updated (for example, its mean and / or covariance function can be updated) based on the new (s) evaluation (s) of the objective function. As another example, in the modalities in which the probabilistic model of an objective function comprises a neural network, the neural network can be updated (for example, the probability distributions associated with the weights of a neural network can be updated) based on (s ) new assessment (s) of the objective function.
[0088] An illustrative non-limiting example of updating a probabilistic model of an objective function based on one or more evaluations of the objective function is illustrated in Figures 2A to 2D. Figure 2A illustrates a probabilistic model of objective function 200 generated based on the three previously obtained evaluations of the ob function
35/95 objective in three points to obtain the respective values of objective function 202, 204 and 206. In the illustrative example, the probabilistic model comprises a Gaussian process that was used to calculate one was used to calculate an estimate 205 of the objective function by calculating it if the predicted average of the conditioned Gaussian distribution in the three previous evaluations of the objective function and a measure of uncertainty associated with the 205 estimate by calculating the predicted covariance (variance in this 1-dimensional example) conditioned in the three previous evaluations of the objective function. The measure of uncertainty is illustrated in Figure 2A by a shaded region shown between curves 207 and 209. It can be seen from Figure 2A that the probabilistic model is more uncertain about the objective function in regions where the objective function has not been evaluated and less uncertainty around the regions where the objective function was assessed (for example, the region of uncertainty shrinkage closest to assessments 202, 204 and 206). That is, the uncertainty associated with the estimation of the objective function is greater than in the regions where the objective function has not been evaluated (for example, the expected variance of the Gaussian process is greater in regions where the objective function has not been evaluated; the variance predicted is 0 at points where the objective function has not been evaluated since the value of the objective function at these points is exactly known).
[0089] Figure 2B illustrates the probabilistic model of objective function 200, after the probabilistic model has been updated based on an assessment of the additional objective function 200 at a new point to obtain the respective objective function value 208. The updated probabilistic model can be used to calculate an updated estimate 210 of the objective function 200 by calculating the predicted average of the conditioned Gaussian distribution in the previous four evaluations of the objective function and a measure of uncertainty associated with the estimate 210 by calculating the predicted covariance based on the four evaluations
36/95 previous sections of objective functions. The measure of uncertainty is illustrated in Figure 2B by a shaded region shown between curves 211 and 213. As can be seen from Figure 2B, changes in the probabilistic model are more pronounced around the region of the new assessment - the estimate 210 it passes through the value 208 (as opposed to the estimate 205 as shown in Figure 2A) and the uncertainty associated with the estimate in the region of value 208 shrinks. In this way, the probabilistic model represents the objective function 200 with the highest fidelity in the evaluation value of coverage of region 208 than it did before the additional evaluation of the objective function.
[0090] Figure 2C illustrates the probabilistic model of objective function 200, after the probabilistic model has been updated based on an assessment of the additional objective function 200 at a new point to obtain the respective objective function value 214. The updated probabilistic model can be used to calculate an updated estimate 215 of the objective function 200 by calculating the predicted average of the conditioned Gaussian distribution in the five previous evaluations of the objective function and a measure of uncertainty associated with the estimate
215 by calculating the predicted covariance based on the five previous assessments of objective functions. The uncertainty measure is illustrated in Figure 2C by a shaded region shown between the curves
216 and 217. As can be seen from Figure 2C, the changes in the probabilistic model are more pronounced around the region of the new assessment - estimate 215 passes through the value 214 (in contrast to estimates 205 and 210 shown in Figures 2A and 2B, respectively) and the uncertainty associated with the estimate in the region of value 214 shrinks. In this way, the probabilistic model represents the objective function 200 with the highest fidelity in the region coverage assessment value 214 than it did before the additional evaluation of the objective function.
37/95
[0091] Figure 2D illustrates the probabilistic model of objective function 200, after the probabilistic model has been updated based on several additional assessments of objective function 200. The updated probabilistic model can be used to calculate an updated estimate 220 of objective function 200 and an associated measure of uncertainty based on all previous assessments of the objective function. The uncertainty measure is illustrated in Figure 2C by the shaded region shown between curves 220 and 221. As can be seen from Figure 2D, the probabilistic model represents objective function 200 with greater fidelity as a result of the surrounding information of the objective function obtained during the additional evaluations.
[0092] It should be noted that the examples shown in Figures 2A to 2D are merely illustrative and not limiting, as the entire objective function may not be known in practice; only point assessments may be available. The entire objective function 200 is shown at present to help how additional evaluations of the objective function can be used to update the probabilistic model of the objective function. It should also be noted that although the illustrative objective function 200 is one-dimensional in the examples in Figures 2A to 2D, this is not a limitation of the technology described in this document. An objective function can be defined in a domain of any suitable dimension d (for example, d is at least two, d is at least three, d is at least five, d is at least 10, d is at least 25, d is at least 50, d is at least 100, d is at least 500, d is at least 1000, d is between 10 and 100, d is between 25 and 500, d is between 500 and 5000, etc.). For example, an objective function that represents the relationship between hyperparameter values of a machine learning system and values indicative of the performance of the machine learning system configured.
38/95 with the hyperparameter values can be defined in a domain whose dimensionality is equal to the number of hyperparameters used to configure the machine learning system.
[0093] As illustrated above, a probabilistic model of an objective function can be updated based on one or more evaluations of the objective function. Although the objective function can be updated based on the evaluation of the objective function at any (any) point (s), the evaluation of the objective function at some points can provide more information about the objective function and / or the extreme points of the objective function of the than elsewhere. As an example, the objective function can be evaluated at one or more points that provide information about regions of the objective function that have not been sufficiently explored (for example, points far from the points at which the objective function was evaluated, points at which the model of the objective function is the most uncertain about the objective function, etc.). As another example, the objective function can be evaluated at one or more points that provide information about the regions of the objective function that are believed to contain an extreme point (for example, a local minimum, a local maximum, a global minimum, a global maxim, etc.), whose information can be useful in solving the underlying optimization.
[0094] As a non-limiting example, the evaluation of an objective function that relates to hyperparameters of a machine learning system (for example, a machine learning system that comprises one or more neural networks to perform object recognition) for machine learning system performance when configured with hyperparameters at some points (for some machine learning system hyperparameter values) can provide more information about the objective function and / or extreme points of the objective function than in
39/95 other points. The performance evaluation of the machine learning system for some hyperparameter values can provide information about regions of the objective function that have not been sufficiently explored. For example, evaluating the performance of the machine learning system (objective function evaluation) at distant hyperparameter values, according to an appropriate distance metric, of hyperparameter values for which the performance of the machine learning system was evaluated can provide information about regions of the objective function not previously explored (for example, related to a global exploration of the hyperparameter value space). As another example, the performance evaluation of the machine learning system for hyperparameter values in which the performance estimate provided by the probabilistic model of the objective function is associated with a high variance such that there is uncertainty (for example, at least one limit quantity of uncertainty) associated with the belief of the probabilistic model for how well the machine learning system would perform for a given set of hyperparameter values. As another example, evaluating the performance of a machine learning system for hyperparameter values close to the hyperparameter values for which the performance of the machine learning system is believed to be good (for example, the best performance for any hyperparameter values previously seen), can lead to the discovery of hyperparameter values for which the performance of the machine learning system is even better (for example, related to the local exploration of the hyperparameter values space).
[0095] Thus, in some modalities, given a probabilistic model of an objective function estimated based on one or more previously completed evaluations of the objective function, one of
40/95 informed split can be made on which point (s) to evaluate the next objective function. This decision can balance the objectives of global exploration (for example, the exploration of objective function regions where there are few assessments and / or where the uncertainty associated with the objective function estimates provided by the probabilistic model can be high) and local exploration (for example, the exploration of the regions of the objective function close to one or more local / global maximum and / or minimum).
[0096] In some modalities, the next point (s) at which the objective function is evaluated can be selected using a function utility of acquisition that associates each one or more points for which the objective function can be evaluated to a value that represents the usefulness of the objective function evaluation at that point. For example, when the function aims to relate the hyperparameter values of a machine learning system to its performance, the acquisition utility function can associate each set of hyperparameter values with a value that represents the utility of evaluating the system's performance machine learning for this set of hyperparameter values.
[0097] An acquisition utility function can be used in any suitable way to select the next point to be evaluated. In some modalities, the next point at which the objective function is evaluated can be selected as the point that maximizes the acquisition utility function (or minimizes the acquisition utility function that depends on how that used function is defined). Any suitable acquisition utility function can be used and can express any number of types of utility measures (including utility measures that adequately balance the types of local and global exploration described above).
[0098] In some modalities, the acquisition utility function
41/95 tion may depend on the probabilistic model of the objective function. The acquisition utility function can be specified based on current information about the objective function captured by the probabilistic model. For example, the acquisition utility function can be specified based at least in part on an estimate of the objective function that can be obtained from the probabilistic model (for example, predicted mean), the measure of uncertainty associated with the estimate (for example, example, expected covariance), and / or any other appropriate information obtained from the probabilistic model.
[0099] Figures 2A to 2D illustrate the use of an acquisition utility function to select points at which to evaluate the objective function based at least in part on the probabilistic model of the objective function. The acquisition utility function selects points for evaluation by balancing two objectives: global exploration (through which the points for evaluation are selected to reduce the uncertainty in the probabilistic model of the objective function) and local exploration (through which the points for evaluation are selected to explore the regions of the objective function that are believed to contain at least one extreme point of the objective function, for example, as shown in Figure 2A, the probability model of objective function 200 can be used to calculate the estimate 205 of the objective function and an associated measure of uncertainty shown by the shaded region between curves 207 and 209. The values of the 231 acquisition utility function, calculated based on estimate 205 and the associated uncertainty measure, are shown in the lower portion of the Figure 2A As shown, the 231 acquisition utility function assumes higher values in regions where the uncertainty associated with estimate 205 is higher (for example, between values 202 and 204, and between values 204 and 206) and lower values in regions where the uncertainty associated with estimate 205 is less (for example, around
42/95 of the values 202, 204 and 206). The next point at which the objective function is evaluated is selected as the point at which the 231 acquisition utility function assumes its maximum value (that is, value 230), and the probabilistic model of the objective function is updated based on the evaluation of the objective function at the selected point.
[00100] Since the acquisition utility function depends on the probabilistic model, after the objective 200 probabilistic model is updated so that it is the acquisition utility function. The updated acquisition utility function 233 is calculated based on estimate 210 and the associated uncertainty measure, and is shown in the lower portion of Figure 2B. As can be seen, the 233 acquisition utility function assumes higher values in regions where the uncertainty associated with estimate 210 is greater (for example, between values 204 and 206) and lower values in regions where the uncertainty associated with estimate 205 is smaller (for example, around the values 202, 204, 206 and 208). The next point at which the objective function is evaluated is selected as the point at which the 233 acquisition utility function assumes its maximum value (that is, value 232), and the probabilistic model of the objective function is updated based on the evaluation of the objective function at the selected point.
[00101] Figure 2C illustrates the updated acquisition utility function 235, which is calculated based on estimate 215 and its associated uncertainty measure. Similar to the examples shown in Figures 2A and 2B, the acquisition utility function 235 assumes higher values in regions where the uncertainty associated with estimate 215 is greater. The next point at which the objective function is evaluated is selected as the point at which the acquisition utility function 235 assumes its maximum value (that is, value 234).
[00102] Figure 2D illustrates the updated acquisition utility function 237, which is calculated based on estimate 220 and its measurement
43/95 associated uncertainty. In this example, the acquisition utility function 237 does not assume higher values in regions where the uncertainty associated with estimate 220 is the greatest. Instead, function 237 assumes higher values near the point where the probabilistic model of the objective function indicates that the objective function is likely to have a local and / or global minimum (value 225). Although there are regions of uncertainty associated with estimate 220, none is large enough to capture points where the value of the objective function is less than the value 225. Since the objective, in this example, is to identify a minimum value of the objective function , there is a small additional value in the uncertainty exploration regions associated with estimate 220, as it would be very unlikely in those regions to find points where the objective function assumes values less than the value 225. Instead, the acquisition utility function indicates that it would be more useful to evaluate the objective function around the point at which the objective function probably takes the lowest values, so that a point at which the objective function assumes an even lower value than the value 225 can be identified.
[00103] In some modalities, an acquisition utility function may depend on one or more parameters of the probabilistic model (denoted by Θ) used to model the objective function, the previous points on which the objective function was evaluated (denoted by {x n 1 <n <Nj, and the results of those evaluations (denoted by {y n , 1 <n <Nj). Such an acquisition function and its dependencies can be denoted by a (x; {x n , yn}; Θ) A non-limiting example of an acquisition utility function that depends on one or more parameters of the probabilistic model is the probability of the improvement acquisition utility function The probability of the improvement acquisition utility function via selecting the next point in which the objective function is evaluated in order to maximize the probability that the
44/95 action of the objective function will provide an improvement over the best current value of the objective function (for example, selecting the next set of hyperparameter values on which to evaluate the performance of a machine learning system in order to maximize the probability that evaluating the performance of a machine learning system with those hyperparameter values will lead to better performance of the machine learning system than for any hyperparameter values previously tried). When the probabilistic model of the objective function comprises a Gaussian process, the probability of the improvement utility function «« can be expressed as:
«PI (x; {x it .: Tf) - Φ (· 7 (χ)) λ J (Xhsst)“ tf (x; {x f , - // ,, - tf) σ (χ: tx rf . /, J, tf) (5) where φ () is the cumulative distribution function of the standard normal random variable and where θ σ (x, note the predicted mean and expected variance of the Gaussian process, respectively.
[00104] Another non-limiting example of an acquisition utility function that depends on one or more parameters of the probabilistic model is the expected improvement utility function. The expected improvement utility acquisition function aims to select the next point at which the objective function is evaluated in order to maximize the expected improvement over the best current value of the objective function. When the probabilistic model of the objective function comprises a Gaussian process, the improvement utility acquisition function can be expressed as:
úEiíx; } .4) ~ σ (χ: {χ η> ρ η |, ^ ['ίχ; Φ (ΊΧ)': Λ '( λ . Η. 1
45/95 where Λ / () is the probability density function of the standard normal random variable.
[00105] Another non-limiting example of an acquisition utility function that depends on one or more parameters of the probabilistic model is the regret minimization acquisition function (sometimes called the lower confidence interval acquisition function). When the probabilistic model of the objective function comprises a Gaussian process, the acquisition function of minimizing regret can be expressed according to:
αLCSÍΛ:.. {X '' .V <> 1 '$) - ri (Λ 1 * - //'·> m) "" σ (x; p {x Physics I;} in which κ it is the adjustable parameter to balance local and global exploration.
[00106] Another non-limiting example of an acquisition utility function is the entropy search acquisition utility function. The entropy search acquisition utility function aims to select the next point at which the objective function is evaluated in order to decrease uncertainty as to the location of the minimum of the objective function (or, equivalently, as to the location of the maximum of the multiplied objective function negative). For this purpose, the next point at which the objective function is evaluated is selected by iteratively evaluating the points that will decrease the entropy of the probability distribution over the minimum of the objective function. The entropy search utility function can be expressed as follows. Given a set of C points X, the probability of a point x € X that has the minimum objective function value can be expressed according to:
(8) where f is the vector of values of the objective function at points X, h () is the
46/95 / | 0p (Í | x) there) <lf ; / (9) Heaviside step function, i x - f} ·· f Xi; - 5 is the posterior probability of the values in the vector f given the past evaluations of the objective function, and 1 is the probability that the objective function assumes the y-value according to the probabilistic model of the objective function. The UKL entropy search acquisition function can then be written as follows:
where 1 indicates that the imagined observation {x, y} has been added to the set of observations, ^! χ ) represents: x - tí; (^ - ^.): / = 1 ^ h (P) represents the entropy of P, and P min represents
I'S Üií »e - M (M .. r 1
[00107] Each of the examples described above of an acquisition utility function depends on the parameters Θ of the probabilistic model. As discussed above, the inventors recognized that the performance of Bayesian optimization (for example, to identify the hyperparameter values for a machine learning system) with the use of an acquisition utility function that depends on the parameters of the probabilistic model can take to poor overall performance. For example, a probabilistic model that comprises a two-dimensional Gaussian process (for example, used to model a two-dimensional objective function, for example, from d hyperparameter values for the respective performance of the machine learning system) can be associated with d + 3 parameters that include, d length scales, covariance amplitude, noise variance under observation and constant mean. In practice, the values of the probabilistic model parameters Θ are adjusted using various procedures, but the overall optimization performance
47/95 is sensitive to how the parameters are adjusted.
[00108] Thus, in some modalities, an integrated acquisition utility function is used, which may be less sensitive to the parameters of the probabilistic model of the objective function.
[00109] In some embodiments, an integrated acquisition utility function can be obtained by selecting an initial acquisition utility function that depends on the parameters of the probabilistic model (for example, any of the utility functions described above can be used as the initial acquisition utility function) and calculating the integrated acquisition utility function by integrating (marginalizing) the effect of one or more of the parameters in the initial acquisition utility function. For example, the integrated acquisition utility function can be calculated as a weighted average (for example, a weighted integral number) of occasions in the initial acquisition utility function, with each occasion in the initial acquisition utility function corresponding to values parameters of the probabilistic model, and each weight corresponds to the probability of the specific parameter values given the objective function evaluations obtained previously.
[00110] For example, an integrated acquisition utility function 'ι χ ,, ·' Μ ί ! can be calculated by selecting an initial acquisition utility function that depends on the probabilistic model parameters Θ, and calculating < ' X ' through integration (average) of the parameters Θ in proportion to the subsequent probability of Θ according to:
where the weight
48/95 represents the posterior probability of the parameters Θ according to the probabilistic model given the N evaluations in points {x n ; 1 n <N} and the results of these assessments {y n ; 1 <n <N}.
[00111] The calculation of an integrated acquisition utility function is further illustrated in Figures 3A and 3B. Figure 3A illustrates three occasions of an initial acquisition utility function calculated for three different sets of parameter values for the underlying probabilistic model. Each occasion was calculated based on the same set of objective function assessments. Figure 3B illustrates the integrated acquisition utility function obtained by the weighted average of the three occasions of the initial acquisition utility function shown in Figure 3A. On average, the weight that corresponds to a specific occasion of the initial acquisition function corresponds to the probability of the probabilistic model parameter values used to generate the specific occasion of the initial acquisition function.
[00112] As can be seen from the discussion above, an integrated acquisition utility function does not depend on the values of the probabilistic model parameters Θ (although it still depends on previous evaluations of the objective function). As a result, the integrated acquisition utility function is not sensitive to the parameter values of the probabilistic model, which the inventors have observed to improve the strength and performance of conventional Bayesian optimization techniques.
[00113] In some modalities, the integrated acquisition utility function can be calculated in closed form. However, in modalities in which the integrated acquisition utility function may not be obtained in a closed form, the integrated acquisition utility function can be estimated using numerical techniques. For example, in some modalities, Monte Carlo simulation techniques can be used to approximate the utility function of
49/95 integrated acquisition and / or finding a point (or an approximation to the point) at which the integrated acquisition utility function reaches its maximum. Any Monte Carlo simulation techniques can be employed including, but not limited to, rejection sampling and techniques, adaptive rejection sampling techniques, importance sampling techniques, adaptive importance sampling techniques, Monte Cario technique using chains Markov (for example, cut sampling, Gibbs sampling, Metropolis sampling, Metropolis sampling within Gibbs sampling, exact sampling, simulated cooling, parallel cooling, annealed sampling, Monte Carlo population sampling, etc.), and techniques Monte Carlo sequential filters (for example, particle filters).
[00114] Figure 4 is a flowchart of an illustrative process 400 to perform the optimization with the use of an objective function at least in part with the use of an integrated acquisition function and a probabilistic model of the objective function, according to some modalities of the technology described in this document. That is, process 400 can be used to identify an extreme point (for example, a local minimum, local maximum, global minimum, global maximum, etc.) of the objective function using the techniques described in this document. Process 400 can be performed using any (any) computing device (s) comprising one or more computer hardware processors, as the technology aspects described in this document are not limited in this regard. .
[00115] In some modalities, process 400 can be applied to identify (for example, locate or approximate the locations of) one or more extreme points of an objective function that relates hyperparameter values of a machine learning system
50/95 to the respective values that provide a measure of performance of the machine learning system. Process 400 can be used to define hyperparameter values for any of the machine learning systems described in this document and / or any other suitable machine learning systems. Additionally or alternatively, process 400 can be applied to identify (for example, locate or approximate the locations of) one or more extreme points of an objective function that arise in any other suitable optimization problem, examples of which have been provided.
[00116] Process 400 starts at action 402, in which a probabilistic model of the objective function is initialized. In some modalities, the probabilistic model of the objective function may comprise a Gaussian process. In some modalities, the probabilistic model of the objective function may comprise a neural network. In some modalities, the probabilistic model of the objective function may comprise a function regression model with an adaptive basis (linear or non-linear). In any case, it should be noted that any other type of probabilistic model of the objective function can be used, as the aspects of the technology described in this document are not limited by any specific type of probabilistic model of the objective function.
[00117] The probabilistic model of the objective function can be initialized by defining the values for one or more (for example, all) of the parameters of the probabilistic model. The parameter (s) can be set to any suitable values, which on occasion can be based on any previous information available about the objective function, if any. The parameter values can be stored in memory or any other suitable type of non-transitory computer-readable medium. In some fashion
51/95, the initial values of the parameters can be initialized based at least in part on the information obtained from evaluations previously obtained from another objective function related, in some way, to the objective function. It is discussed in more detail below in relation to multitasking optimization techniques.
[00118] Then, process 400 proceeds to action 404, where a point at which the objective function is assessed is identified. For example, when the objective function relates hyperparameter values of a machine learning system to its performance, a set of hyperparameter values for which the performance of the machine learning system is evaluated can be identified in action 404. Identification it can be performed at least in part with the use of an acquisition utility function and a probabilistic model of the objective function. In some modalities, an acquisition utility function that depends on the parameters of the probabilistic model can be used in action 404, for example, a probability of improvement improvement utility function, an expected improvement acquisition utility function, an acquisition utility function for minimizing regret and an entropy-based acquisition utility function. However, in other modalities, an integrated acquisition utility function can be used in action 404.
[00119] As described above, the integrated utility function can be obtained by selecting an initial acquisition utility function that depends on one or more parameters of the probabilistic model (for example, an improvement utility function probability, expected improvement utility, regret minimization utility function, entropy-based utility function, etc.), and calculating the integrated utility function by integrating the initial acquisition function in relation to one or more of the
52/95 probabilistic model parameters (for example, as indicated above in Equation 10).
[00120] In some modalities, the point at which the objective function is evaluated can be identified as the point (or as an approximation to the point) at which the acquisition utility function reaches its maximum value. In some modalities, the point at which the acquisition function reaches its maximum can be identified exactly (for example, when the acquisition utility function is available in closed form). In some modalities, however, the point at which the acquisition utility function reaches its maximum value may not be identified exactly (for example, due to the fact that the acquisition utility function is not available in closed form), at whose if the point at which the acquisition utility function reaches its maximum value can be identified or approximated using numerical techniques. For example, in some modalities, an integrated acquisition utility function may not be available in closed form and Monte Carlo techniques can be used to identify or approximate the point at which the integrated acquisition utility function reaches its maximum value.
[00121] In some embodiments, the Markov chain Monte Carlo methods can be used to identify or approximate the point at which the integrated acquisition utility function reaches its maximum value. For example, the integrated acquisition utility function can be defined according to the integral value in Equation 10 above, whose integral value can be approximated using the Monte Carlo technique through Markov chains (and / or any other Monte Carlo procedure). In some modalities, the full integral value can be approximated by generating samples of probabilistic model parameter values (in proportion to their later probability given any previously obtained evaluations.
53/95 of the objective function), evaluating the initial acquisition utility function in the generated samples, and using the resulting evaluations to approximate the integrated acquisition utility function and / or to identify or approximate a point at which the integrated acquisition utility function reaches its maximum value. Additional details on how to identify or approximate a maximum value for the integrated acquisition utility function are provided below.
[00122] It should be noted that the point at which the objective function is evaluated is not limited to being a point (or an approximation to the point) at which the acquisition utility function reaches its maximum and can be any other suitable point with the use of the acquisition utility function (for example, a local maximum of the acquisition utility function, a local or global minimum of the acquisition utility function, etc.).
[00123] After the point at which the objective function is evaluated is identified in action 404, process 400 proceeds to action 406, in which the objective function is evaluated at the identified point. For example, when the objective function relates hyperparameter values of a machine learning system to its performance, the performance of the machine learning system configured with the hyperparameters identified in action 404 can be evaluated in action 406.
[00124] After the objective function is evaluated, in action 406, at the point identified in action 408, process 400 proceeds to action 408, in which the probabilistic model of the objective function is updated based on the results of the evaluation. The probabilistic model of the objective function can be updated in any of the countless ways based on the results of the new evaluation obtained in action 406. As a non-limiting example, updating the probabilistic model of the objective function may include updating (for example, estimating in
54/95) one or more parameters of the probabilistic model based on the results of the evaluation performed in action 406. As another non-limiting example, updating the probabilistic model of the objective function may include updating the probabilistic model's covariance kernel (for example, example, when the probabilistic model comprises a Gaussian process, the covariance kernel of the Gaussian process can be updated based on the results of the new evaluation). As another non-limiting example, updating the probabilistic model of the objective function may comprise computing an updated estimate of the objective function using the probabilistic model (for example, which calculates the predicted average of the probabilistic model based on any of the previously assessed obtained from the objective function and the results of the objective function evaluation in action 406). As another non-limiting example, updating the probabilistic model of the objective function can comprise calculating an updated measure of uncertainty associated with the updated estimate of the objective function (for example, which calculates the predicted covariance of the probabilistic model based on any of the previously assessed obtained from the objective function and the results of the objective function evaluation in action 406). As another non-limiting example, updating the probabilistic model can simply comprise storing the evaluation results as those evaluation results can be used subsequently when performing computations using the probabilistic model of the objective function (for example , which calculates an estimate of the objective function, updates one or more parameters of the probabilistic model, etc.).
[00125] After the probabilistic model of the objective function is updated in action 408, process 400 proceeds to decision block 410, in which it is determined whether the objective function should be evaluated in
55/95 another point. This determination can be made in any appropriate way. As a non-limiting example, process 400 may involve performing no more than a limit number of evaluations of the objective function, and when that number of evaluations has been performed, it can be determined that the objective function should not be evaluated again (for example , due to the time and / or computational cost of carrying out such an assessment). On the other hand, when less than the limit number of evaluations has been carried out, it can be determined that the objective function must be evaluated again. As another non-limiting example, the determination of whether to evaluate the objective function again can be made based on one or more objective function values previously obtained. For example, if optimization involves finding an extreme point (for example, a maximum) of the objective function and the objective function's values have not increased by more than a threshold value over previous iterations (for example, a threshold number of evaluations performed), a determination can be made not to evaluate the objective function again (for example, due to the fact that further evaluations of the objective function are not unlikely to identify points at which the objective function assumes values greater than the values at the points in the which the objective function has already been evaluated). In any case, the determination of whether to evaluate the objective function again can be done in several appropriate ways, as the aspects of the technology described in this document are not limited in this sense.
[00126] When it is determined, in decision block 410, that the objective function must be evaluated again, process 400 returns, via a YES branch, to action 404 and actions 404 to 408 are repeated. On the other hand, when it is determined in decision block 408 that the objective function should not be evaluated again, process 400 proceeds to action 412, where an extreme value
56/95 of the objective function can be identified based on one or more values of the objective function obtained during process 400.
[00127] In action 412, an extreme value of the objective function can be identified in any suitable way in the value (s) obtained from the objective function. As a non-limiting example, the extreme value (for example, a maximum) can be selected to be one of the values obtained during the evaluation (for example, adopting a maximum of the objective function values obtained during the 400 process). As another non-limiting example, the extreme value (for example, a maximum) can be obtained using a functional form adjusted to the values of the objective function obtained during process 400 (for example, an estimate of the kernel density of the function objective, a maximum of the objective function estimate obtained based on the probabilistic model, etc.). After the extreme value of the objective function is identified in action 412, process 400 is completed.
[00128] As discussed above, in some modalities, Monte Carlo methods can be used to identify and / or approximate the point at which the integrated acquisition utility function reaches its maximum value. A non-limiting example of how such calculations can be performed is detailed below.
[00129] Let f (x) denote the objective function and the set X denote the set of points at which the objective function can be calculated. Assuming that the objective function has been evaluated N times, it has as input ! <N -, where each x n represents a point at which the objective function was evaluated and y n represents the corresponding value of the objective function (this é, y n = f (x n )) · Let p () denote the probabilistic model of the objective function.
[00130] The integrated acquisition utility function can be given according to:
57/95 fly (12) where
- í í (13) is the predicted marginal density obtained from the probabilistic model of the objective function given θ parameters Θ of the probabilistic model and the probability of the probabilistic model given with, v θ in which u - 'corresponds to a heuristic selection.
For example, the probability of improvement and expected heuristic improvement can be represented, respectively, according to:
(14) (15)
[00131] As discussed above, in some examples, the integrated acquisition utility function of Equation 12 may not be obtained in closed form (for example, it may not be possible to calculate the integral value in relation to parameters Θ in closed form). In this way, the integrated acquisition utility function of Equation 12 can be approximated by the following numerical procedure.
[00132] Initially, for each 1 <j <J, insert a sample 0 (j) according to:
(16) where, under the Bayes rule, (17)
[00133] Any suitable Monte Carlo technique can be used for extraction samples according to Equation 16, including,
58/95 however, without limitation, inversion sampling, importance sampling, rejection sampling and Monte Carlo technique through Markov chain (examples which have been provided).
[00134] Given N samples 1 <j <Jj drawn according to Equation 16, the integrated acquisition utility function can be approximated according to:
(18) [00135] The approximation of the integrated acquisition utility function computed through Equation 18 can be used to identify a point at which it is or (is an approximation of) a point at which the integrated acquisition utility function reaches its maximum value. The objective function can be evaluated at the identified point.
[00136] As discussed above, the inventors recognized that conventional Bayesian optimization techniques use probabilistic models that are not suitable for accurately modeling some types of objective functions. For example, conventional Bayesian optimization techniques use stationary Gaussian processes to model objective functions (for example, the covariance between two outputs is invariant for translations into the input space), but a stationary Gaussian process may not be suitable for modeling a non-stationary objective function. For example, when the objective function refers to a machine learning system's hyperparameter values for its performance, a Gaussian process that has a short-length scale may be more appropriate for modeling the objective function at points close to its maximum value and a Gaussian process that has a longer length scale may be more appropriate for modeling the objective function at points more distant from its
59/95 maximum value (for example, due to the fact that a machine learning system can equally poorly perform all the values of poor hyperparameters, but their performance may be sensitive to small adjustments in good hyperparameter regimes). In contrast, a stationary Gaussian process model would represent the objective function using the same length scale for all points at which the objective function is defined.
[00137] In this way, some modalities are directed to perform Bayesian optimization using a probabilistic model adapted to model objective stationary and non-stationary functions in a more reliable way. In some modalities, the probabilistic model of the objective function can be specified based at least in part on a mapping from one to a non-linear (sometimes called "distortion") of elements in the domain of the objective function, to explain the non objective function. For example, in modalities in which the objective function refers to the hyperparameter values of a machine learning system for its performance, the probabilistic model can be specified based at least in part on a non-linear distortion of the hyperparameter values to explain the non-stationary objective function.
[00138] In some modalities, the probabilistic model of the objective function that explains the non-linearity in the objective function can be specified as a composition of a mapping from one to a non-linear one as a stationary probabilistic model. For example, the probabilistic model of the objective function that explains the nonlinearity in the objective function can be specified as a composition of a one-to-one mapping with a stationary Gaussian process. The covariance kernel of the gaussia process
60/95 cannot be specified at least in part using a non-linear mapping.
[00139] In modalities in which the probabilistic model of the objective function is specified as a composition of a mapping from one to a non-linear and a stationary probabilistic model, the composition can be expressed as follows. Let g (x; φ) denote a mapping from one to a nonlinear parameterized by one or more parameters φ and let p (z; Θ) denote a stationary probabilistic model (for example, a stationary Gaussian process) parameterized by the parameters Θ (points x and points z can be in the same domain or in different domains depending on the choice of mapping from one to a non-linear g (x; φ)). Then, the composition of the mapping from one to a non-linear and the stationary probabilistic model can be used to obtain the probabilistic model given by p (z = g (x; φ); Θ) or p (g (x; φ); θ) for a short period. Using nonlinear mapping g (x; φ) to transform the input z of a stationary probabilistic model, such as a stationary Gaussian process, allows the resulting probabilistic model to explain the non-stationary effects on the objective function.
[00140] In some modalities, the objective function can be a mapping of elements from a first domain to a range and the mapping from one to a non-linear g (x; φ): X—> Z can be a mapping of elements in the first domain (for example, the points x in X) for elements in a second domain (for example, the points z = g (x; φ) in Z). For example, when the objective function refers to hyperparameter values of a machine learning system for its performance, the first domain may comprise hyperparameter values or appropriately normalized hyperparameter values (for example, normalized hyperparameter values to be in a unit hypercube, sphere of uni
61/95, range of a specific diameter, sphere of a specific diameter, etc.), the range can comprise values indicative of performance of the machine learning system and the second domain can comprise values obtained by applying the mapping of a for a non-linear to hyperparameter values in the first domain. That is, the second domain is the range of mapping from one to a non-linear one. The first domain can be the same domain as the second domain (for example, the first domain can be a unit hypercube and the second domain can be a unit hypercube; X = Z using the notation above), although aspects of technology described in this document are not limited in this respect, as the first and second domains may be different (for example, X # Z using the notation above), in some modalities.
[00141] In some modalities, the mapping from one to a non-linear one may comprise a cumulative distribution function of a random variable. In some modalities, the mapping from one to a non-linear one may comprise a cumulative distribution function of the Beta random variable. For example, the mapping of one to a nonlinear point in dimensional space d in which an objective function is defined (for example, the hyperparameter value space of a machine learning system that has hyperparameters d) can be coordinated from right mode specified as follows:
Wá (xá) - BetaCDF (x ^; / ¾) *
where x d is the value of x in its coordinate cT 1 ® 5 113 , BetaCDF refers to the cumulative distribution function (CDF) of the Beta random variable
62/95 and B (a d , ββ) is the Beta CDF normalization constant. Beta CDF is parameterized by positive value parameters (“format”) a d e β ά It must be verified that the mapping from one to a non-linear one is not limited to understanding the cumulative distribution function of a Beta random variable and can, instead, understand the cumulative distribution function of Kumaraswamy random variable, Gamma random variable, Poisson random variable, Binomial random variable, Gaussian random variable, or any other suitable random variable. It must also be verified that the mapping from one to a non-linear one is not limited to being a cumulative distribution function and, for example, it can be any function that uniformly increases or decreases adequately, any suitable bijector function (for example, example, any suitable bijector function that has the dimensional hypercube d as the domain and the range for integer d 1).
[00142] In some embodiments, the mapping from one to a non-linear may comprise a combination (for example, a composition or any other suitable type of combination) of two or more mappings from one to a non-linear. For example, mapping from one to a non-linear one may comprise a combination of two or more cumulative distribution functions. As a non-limiting example, the mapping from one to a non-linear one may comprise a combination of the cumulative distribution function of the Beta distribution and a cumulative distribution function of the Kumaraswamy distribution.
[00143] Non-limiting examples illustrating how a one-to-one nonlinear mapping distorts a non-stationary objective function are shown in Figures 5A to 5F. As an example, a periodic objective function of a non-stationary dimension shown in Figure 5A can be transformed by applying the distortion
63/95 non-linear bijector shown in Figure 5B to obtain a stationary periodic objective function shown in Figure 5C. As another example, an exponential objective function of a non-stationary dimension shown in Figure 5D can be transformed by applying the nonlinear double distortion shown in Figure 5E to obtain a stationary periodic objective function shown in Figure 5F. It should be noted that these two examples are illustrative and not limiting and that the objective functions to which the techniques described in this document can be applied are not limited to being objective functions of one dimension, not to mention the two objective functions illustrating one dimension shown in Figures 5A to 5F.
[00144] The inventors recognized that there are many different nonlinear distortions that can be used to specify a probabilistic model of an objective function. Since the non-stationary nature (if any) of the objective function may not be known in advance, a technique is needed in order to select the appropriate non-linear distortion for use to specify the probabilistic model. Thus, in some modalities, a non-linear distortion can be inferred based, at least in part, on one or more evaluations of the objective function (for example, the maximum a posteriori estimate of the non-linear distortion parameters due to the results of all evaluations can be used to determine non-linear distortion) and the probabilistic model of the objective function can be specified using non-linear distortion.
[00145] In some modalities, the probabilistic model of the objective function can be specified as a function of a family of non-linear distortions, the family of distortions parameterized by one or multiple parameters, in which the parameter (s) can ( m) be inferred based on one or more evaluations of the objective function. For example, the probabilistic model of the objective function can be
64/95 specified using a family of cumulative distribution functions of the Beta random variable, parameterized by two positive format parameters α and β. Each of the parameters of format α and β can be assumed, a priori (that is, before any objective function evaluations are carried out), to be distributed (for example, independently of each other) according to a normal log distribution . For example, in some modalities, the parameters of format a d and β ά of a non-linear distortion (for example, for distortion of the nth coordinate of points in the space in which the objective function is defined) can be assumed to be distributed according:
log (ad) ~ σ α ) log (3d) ~ Λ / ϊμ / υ M (2 0)
[00146] Thus, in some modalities, the probabilistic model of an objective function can be specified using a family of non-linear distortions (for example, a family of specified non-linear distortions placing itself before distributions in parameters of a function cumulative distribution of a random variable, such as the Beta random variable). Such a probabilistic model can be used to identify (for example, locations or proximity to the locations of) one or the extreme points of an objective function related to the hyperparameter values of a machine learning system to respective values that provide a measure of system performance machine learning and / or any objective function resulting from any other suitable optimization problem, examples of which have been provided. This can be done in any appropriate way and, in some modalities, it can be done by integrating (weighting) the parameters of the family of non-linear distortions by treating these parameters as parameters of the probability model to be integrated as described with reference to the process 400 above.
65/95
[00147] Thus, in some modalities, optimization with the use of an objective function at least in part using a probabilistic model of the objective function that depends on a mapping from one to a non-linear one can be performed according to the process 400, with appropriate modifications (for example, for step 404 of process 400) to explain the dependence of the probabilistic model on nonlinear mapping. In particular, the parameters of the family of non-linear distortions (for example, the α and β scaled parameters of a Beta CDF) are treated as parameters of the probabilistic model and the integrated acquisition utility function used to identify points at which the assessment of the objective function is obtained by integrating at least these parameters of the probabilistic model. More generally, the probabilistic model can comprise two sets of parameters θ and φ, in which the parameters φ are the parameters of the family of non-linear distortions and Θ are all other parameters of the probabilistic model and the integrated acquisition utility function. it can be obtained by integrating an initial acquisition utility function in relation to θ, φ or Θ and φ.
[00148] As discussed in reference to process 400, in some modalities, numerical techniques can be used to identify and / or approximate the point at which the integrated acquisition utility function reaches its maximum value. Numerical techniques (for example, rejection sampling, importance sampling, Monte Carlo Markov chains, etc.) may also be necessary for this purpose when the probabilistic model depends on the mapping parameters from one to a non-linear one. A non-limiting example of how Monte Carlo techniques can be used to identify and / or approximate the point at which the integrated acquisition utility function reaches its maximum value, when the probabilistic model depends on a non-linear mapping, below
66/95 xo.
[00149] Let f (x) denote the objective function and the set X denote the set of points at which the objective function can be calculated. Assuming that the objective function has been evaluated N times, {g (x n ; Φ) y n ; for 1 <n <N}, where each x n represents a point at which the objective function was evaluated, g (x n ; Φ) represents the result of applying a non-linear distortion distortion function g, which has parameters φ, for point x n , y n represents the corresponding value of the objective function (that is, y n = f (x n )). Let p () denote the probabilistic model of the objective function that depends on a mapping from one to a nonlinear g, where the probabilistic model has parameters Θ (one or more parameters of the probabilistic model not including any parameters of the mapping of a for a non-linear) and φ (one or more parameters of mapping from one to a non-linear). Parameters Θ and φ are assumed to be independent. The integrated acquisition utility function can be approximated by the following numerical procedure.
[00150] Initially, for each 1 <j <J, a sample (0 (j) , φ & ) is extracted according to:
ι]) , φ®) □ ρ (β, φ I {g (x n ; φ), y n ; for 1 <n <NJ) (21)
[00151] Any suitable Monte Carlo technique can be used for extraction samples according to Equation 21, including, but not limited to, inversion sampling, importance sampling, rejection sampling and Monte Carlo technique by means of chain Markovs (examples which have been provided).
[00152] Given N samples {(0 (j) , φ 0) ); 1 <j <J} extracted according to Equation 21, the integrated acquisition utility function can be approximated according to:
67/95
[00153] The approximation of the integrated acquisition utility function computed through Equation 22 can be used to identify a point which is (or is an approximation of) a point x * at which the integrated acquisition utility function reaches its value maximum. This can be done in any appropriate way. For example, in some modalities, the integrated acquisition function can be approximated according to Equation 22 in a grid of points and the point in the grid for which the objective function reaches the maximum value can be taken as the point x *. Alternatively, local scanning (for example, based on the gradient of the distortion function) can be performed around one or more points on the grid to identify point x *. After the point x * is identified, the objective function can be evaluated at x *.
[00154] As discussed above, conventional bayesine optimization techniques require choosing the next point at which the objective function evaluation (for example, identifies the next set of hyperparameter values for which it evaluates the performance of a machine learning system ) based on the results of all previous assessments of the objective function. Each objective function assessment must be completed before the next point at which the objective function assessment is identified. In this way, all objective function evaluations must be evaluated sequentially (that is, one at a time), when using conventional Bayesian optimization methods.
[00155] In contrast, the technology described in this document can be used to parallel Bayesian optimization techniques so that multiple assessments of the objective function can be performed
68/95 in parallel, which is advantageous when each evaluation of the objective function is computationally expensive to perform, as the case may be when identifying hyperparameter values for machine learning systems that take a long time (for example, days) to train. Parallel assessments of the objective function can be performed using different computer hardware processors. For example, parallel evaluations of objective function can be performed using different computer hardware processors integrated on the same substrate (for example, different processor cores) or different computer hardware processors not integrated with the same substrate (for example, different computers, different servers, etc.).
[00156] The inventors recognized that paralleling conventional Bayesian optimization simply by concomitantly evaluating the objective function at points other than all of which are chosen based on the results of previously completed evaluations is ineffective due to the fact that the selection points in the which evaluates the objective function in this way do not take into account any information about pending evaluations of the objective function. Thus, in some modalities, the next point at which the objective function is evaluated is based on information about one or more pending evaluations of the objective function and one or more previously completed evaluations of the objective function. For example, the next point at which to evaluate the objective function can be selected before the completion of one or more previously started evaluations of the objective function, but the selection can be made based on the respective probabilities of potential results of pending evaluations of the function objective so that some information about the pending evaluations (for example, the particle points
69/95 res on which the assessment is being carried out) is taken into account when selecting the next point at which the objective function is assessed.
[00157] In some modalities, the selection of the next point at which the objective function is evaluated based on one or more pending evaluations of the objective function can be performed using an acquisition utility function that depends on the probabilities of results potentials of pending evaluations of the objective function, the probabilities determined according to the probabilistic model of the objective function. In some modalities, the selection of the next point at which the objective function is evaluated comprises using an acquisition utility function obtained at least in part by calculating an expected value of an initial acquisition utility function in relation to the potential values of the objective function in the plurality of points. The initial acquisition utility function can be a probability of an improvement utility function, an expected improvement utility function, the regret minimization utility function, an entropy-based utility function, an acquisition utility function and / or any other suitable acquisition utility function.
[00158] Figure 6 is a flowchart of an illustrative process 600 to perform the optimization with the use of an objective function at least in part using multiple computer hardware processors, according to some modalities of the technology described in this document. . Process 600 can be used to identify an extreme point (for example, a local minimum, local maximum, global minimum, global maximum, etc.) of the objective function using the techniques described in this document. The 600 process can be performed using different computer hardware processors of any suitable type. For example, at least
70/95 on some (for example, all) of process portions 600 can be performed using different computer hardware processors integrated on the same substrate (for example, different processor cores) or computer hardware processors. different computers not integrated on the same substrate.
[00159] In some modalities, process 600 can be applied to identify (for example, locate or approximate the locations of) one or more extreme points of an objective function that relates hyperparameter values of a machine learning system to the respective values which provide a measure of machine learning system performance. Process 600 can be used to define hyperparameter values for any of the machine learning systems described in this document and / or any other suitable machine learning systems. In addition or alternatively, process 600 can be applied to identify (for example, locate or approximate the locations of) one or more extreme points of an objective function that arise in any other suitable optimization problem, examples of which have been provided.
[00160] Process 600 starts in action 602, in which a probabilistic model of the objective function is initialized. This can be done in any suitable way and, for example, it can be done in any of the ways described in reference to action 402 of process 400.
[00161] Next, process 600 proceeds to decision block 604, in which it determined whether there are any pending assessments of the objective function (that is, assessments of the objective function that are pending completion). A pending assessment can be an assessment at which point at which the assessment is carried out
71/95 has been identified (for example, the set of hyperparameter values against which the performance of a machine learning system is assessed has been identified), but the assessment of objective function at the identified point has not started (and therefore has not It has been completed). A pending assessment can be any assessment of the objective function that has been initiated but has not been completed. The determination of whether there are any pending assessments of the objective function can be carried out in any appropriate manner, as the aspects of the technology described in this document are not limited to how far such determination can be made.
[00162] When it is determined, in decision block 604, in which there are no pending assessments of the objective function, process 600 proceeds to action 605, in which a point at which the objective function is evaluated is defined with the use of a probabilistic model of the objective function and an acquisition utility function. This can be done in any appropriate way and, for example, it can be done in any of the ways in reference to action 404 of process 400. Any suitable acquisition utility function can be used for action 605 including, for example, any one of the acquisition utility functions described in this document.
[00163] On the other hand, when it is determined in decision block 604 in which there is one or more pending evaluations of the objective function, process 600 proceeds to action 606 in which the information about the pending evaluation (s) ( s) are obtained. Information about the pending assessment (s) may include information that identifies the point (s) (for example, hyperparameter value sets) at which the pending assessment (s) ( s) are being (or should be) carried out. Information about the pending evaluation (s) ^) may also include information about the probabilities of potential results of the pending evaluation (s). Informs them
72/95 tions about the probabilities of potential results of pending evaluation (s) can be obtained based, at least in part, on the probabilistic model of the objective function.
[00164] Next, process 600, proceeds to action 608 in which one or more new points on which the objective function is assessed are identified based, at least in part, on information about the pending assessments obtained in action 608 Any suitable number of points at which the objective function is evaluated can be identified in action 608. For example, when there are M pending evaluations of the objective function (where M is an integer greater than or equal to 1), M points in which the objective function is assessed can be identified in action 608. Nevertheless, in some modalities, some other than M points can be identified in action 608. In some modalities, more than M points can be identified in action 608.
[00165] In some modalities, the point (s) at which the objective function is assessed are identified based, at least in part, on information that identifies the (s) point (s) at which pending assessments are being (or should be) carried out. In some modalities, the point (s) at which the objective function is assessed are further identified based on the probabilities of potential outcomes from objective function assessments, where probabilities are determined based, at least in part, on the probabilistic model of the objective function.
[00166] For example, in some modalities, the point (s) at which the objective function is evaluated can be identified using a utility function of acquisition that depends on information about pending assessments and the probabilistic model. The acquisition utility function may depend on the points at which pending assessments are being (or should be) carried out and
73/95 respective probabilities of their results according to the probabilistic model of the objective function (for example, according to the predictive distribution induced by the probabilistic model of the objective function).
[00167] For example, the following h (x) acquisition utility function can be used to identify point (s) for evaluation as part of action 608:
x A 5i ii; ..... ..... '(23) where the set {x n , y n ; 1 <n <N} corresponds to N previously completed assessments (both points at which the objective function was assessed and assessment results are available for assessments previously completed), the set {x m ; 1 <m <M} corresponds to M pending assessments (points at which the objective function is or should be evaluated are available for pending evaluations), p () is the probabilistic model of the objective function ei // (y, y *) corresponds to a heuristic selection (for example, as described above in reference to Equations 14 and 15). In this way, the Equation 23 acquisition utility function is calculated as an expected value of an initial acquisition utility function (specified through the heuristic selection i // (y, y *)) in relation to potential values of the objective function in the plurality of points {x m ; 1 <m <M}.
[00168] In some modalities, when multiple points at which the objective function is assessed are identified in action 608, the points can be identified one at a time and the acquisition utility function (for example, the acquisition utility function shown in Equation 23) can be updated after each point is identified. For example, after a first point is selected in action 608, a second point can be selected using an acquisition utility function that depends on the information that identifies the first point.
74/95
[00169] In some modalities, a new point in which the objective function is evaluated can be identified in action 608 as the point (or as an approximation to the point) at which the acquisition utility function reaches its maximum value. In some modalities, the point at which the acquisition function reaches its maximum can be identified exactly (for example, when the acquisition utility function is available in closed form). In some modalities, however, the point at which the acquisition utility function reaches its maximum value may not be identified exactly (for example, due to the fact that the acquisition utility function is not available in closed form), at whose if the point at which the acquisition utility function reaches its maximum value can be identified or approximated using numerical techniques.
[00170] For example, in some modalities, the utility function of acquiring Equation 23 can be approximated through a Monte Carlo estimate according to:
/ rtiâii {min min
J ·· ^ '- (24) where y m (j) is a sample of the dimensional predictive distribution M induced by. When the probabilistic model comprises a Gaussian process, the predictive distribution is Gaussian and y m (j) can be generated by simulating the Gaussian distribution with the appropriate parameters. For other probabilistic models, other numerical techniques can be used, including, without limitation, Monte Carlo techniques, such as rejection sampling, importance sampling, Monte Carlo Markov chain, etc.
[00171] It should be noted that the point at which the objective function is evaluated is not limited to being a point (or an approximation with the pon
75/95 to) at which the acquisition utility function reaches its maximum and can be any other suitable point using the acquisition utility function (for example, a local maximum of the acquisition utility function, a local minimum or acquisition utility function, etc.).
[00172] After one or more points (s) at which the objective function is evaluated are identified in action 608, process 600 proceeds to action 610, in which the evaluation of the objective function in (s) ) identified point (s) is started. This can be done in any appropriate way. For example, in some modalities, when multiple points are identified in action 608, the assessment of the objective function at the identified points can be initiated so that the objective function is evaluated using different computer hardware processors (for example, when the first and second points are identified in action 608, the evaluation of the first and second points can be initiated so that the objective function is evaluated at the first point with the use of a first computer hardware processor and at the second point with the use of a second computer hardware processor other than the first computer hardware processor).
[00173] Next, process 600 proceeds to decision block 612, in which it is determined whether the assessment of the objective function at any point is completed. This determination can be made in any appropriate way. When it is determined that the assessment of the objective function has not been completed at any point, process 600 awaits the assessment to be completed at least one point. On the other hand, when it is determined that the evaluation of the objective function is completed in one or more points, process 600 proceeds to action 614 in which the probabilistic model of the objective function is updated based on the results of the completed evaluations. The model
76/95 probabilistic can be updated in any appropriate way and, for example, can be updated in any of the ways described in reference to action 408 of process 400.
[00174] After the probabilistic model of the objective function is updated in action 614, process 600 proceeds to decision block 616, in which it is determined whether the objective function should be evaluated at another point. This determination can be made in any suitable way and, for example, it can be done in any of the ways described in reference to decision block 410 of process 400.
[00175] When it is determined, in decision block 616, that the objective function must be evaluated again, process 600 returns, through a YES branch, to decision block 604 and action / decision blocks 604 to 612 are repeated. On the other hand, when it is determined in decision block 616 that the objective function must not be evaluated again, process 600 proceeds to action 618, in which an extreme value of the objective function can be identified based on one or more values of the objective function obtained during process 600.
[00176] In action 618, an extreme value of the objective function can be identified in any appropriate way based on the value (s) obtained from the objective function and, for example, can be identified in any way described in in relation to action 412 of process 400. After the extreme value of the objective function is identified in action 618, process 600 is completed.
[00177] As discussed above, some modalities are directed to Bayesian optimization techniques which, when applied to a particular optimization task, can take advantage of the information obtained while applying Bayesian optimization techniques to one or more related optimization tasks. These techniques
77/95 are referred to in this document as Bayesian multitasking optimization techniques ”. The multitasking optimization techniques described in this document can be applied to different types of problems, examples of which are provided below.
[00178] As a non-limiting example, in some modalities, the Bayesian multitasking optimization techniques described in this document can be applied to the task of identifying hyperparameter values for a particular machine learning system and, for that purpose, can using the information obtained previously while (performing the related task) identifies the hyperparameter values for a related machine learning system. The related machine learning system can be any machine learning system that shares one or more (for example, all) hyperparameters with the particular machine learning system. For example, the particular machine learning system may comprise a first neural network that has a first set of hyperparameters and the related machine learning system may comprise a second neural network (for example, a neural network that has a different number of layers from the first neural network, a neural network that has a different non-linearity from the first neural network, the first and second neural networks can be the same, etc.) that has a second set of hyperparameters so that the first and second sets of hyperparameters share at least one hyperparameter. In addition, even if the first and second sets of hyperparameters do not overlap, a set space of the parameter can be created in any suitable way. For example, a 'default' value for each parameter can be inferred, so that if that parameter is absent for a particular model, then the default value can be used. In this way, each network
Neural 78/95 can have the same set of hyperparameters, so that any standard kernel can be applied.
[00179] The information obtained previously although identifying hyperparameters of a related machine learning system may comprise results of the performance evaluation of the related machine learning system for one or more sets of hyperparameter values. Such information may indicate how the related machine learning system (for example, the system comprising the second neural network) was performed for several hyperparameter values and, as a result, this information can be used to guide the search for hyperparameter values for the private machine learning system (for example, the system comprising the first neural network).
[00180] It should be noted that the multitasking optimization techniques described in this document are not limited to the use of information obtained previously from a completed optimization task (for example, information obtained from completing the completed identification task) hyperparameters for a machine learning system — completed in the sense that the hyperparameter values to be used have been identified and the machine learning system has been configured for use with the identified hyperparameter values). In some modalities, the multitasking optimization techniques described in this document can be applied to multiple related optimization techniques that are solved simultaneously. In such modalities, the multitasking optimization techniques described in this document may involve the evaluation of multiple different objective functions, where each objective function corresponds to a respective optimization task. Because the tasks are related, the evaluation results of an objective function corresponding to a tare
79/95 fa can be used to guide the selection of a point at which to evaluate another corresponding objective function for another related task. [00181] As a non-limiting example, in some modalities, the Bayesian multitasking optimization techniques described in this document can be applied to the problem of estimating an average value of an objective function that can be expressed as a combination of objective functions in that each corresponds to a respective among the multiple related tasks. Such a problem arises in several definitions that include, for example, when identifying hyperparameters of a machine learning system that would optimize the performance of the machine learning system, in which the performance of the machine learning system is obtained by applying validation T-fold crossover, which is a technique for estimating the generalization error of machine learning systems.
[00182] In T-fold cross-validation, the data used to train a machine learning system is partitioned into T subsets, called “bends” and the performance measure of a machine learning system is calculated as the average performance of the machine. machine learning system on T-bends. The performance of the machine-learning system for a particular bend is obtained by training the machine-learning system on the data in all other bends and evaluating the system's performance on the data in the particular fold. Thus, to assess the performance of the machine learning system for a particular set of hyperparameter values, the machine learning system must be trained T times, which is computationally costly for machine learning systems and / or data sets large complexes. However, it is likely that the performance measurements associated with each of the T-bends are correlated
80/95 with each other, so that assessing the performance of the machine learning system for a particular fold using a set of hyperparameter values can provide information that indicates the performance of the machine learning system for another fold using the same set of hyperparameter values. As a result, the performance of the machine learning system may not need to be evaluated for each of the T-folds for each set of hyperparameter values.
[00183] Thus, in some modalities, the multitasking optimization techniques described in this document can be applied to the T-fold cross-validation problem, reformulating this problem as a multitasking optimization problem in which each task corresponds to the identification from a set of hyperparameter values to optimize the performance of the machine learning system for a particular cross-validation fold (that is, for a respective subset of the data used to train the machine learning system). The objective function for a task refers to machine learning system hyperparameter values for the performance of the machine learning system for the task-associated cross-validation bend (for example, the objective function for the task associated with the task bending). cross validation t refers to the values of the hyperparameters of the machine learning system for the performance of the machine learning system calculated by training the machine learning system on the data in all the bends beyond the t and evaluating the performance of the machine. trained machine learning system resulting in the data at t-fold.) Thus, it should be noted that the multitasking optimization techniques described in this document can be used to maximize a single objective function that can be specified as a
81/95 function of multiple other objective functions (for example, which can be called subobjective functions).
[00184] As another non-limiting example, in some modalities, the Bayesian multitasking optimization techniques described in this document can be applied to the problem of multiple related optimization tasks of simultaneous solution, in which the objective function associated with one of the tasks it may be cheaper to assess than the objective function associated with another task. When two tasks are related, then, objective function assessments for one task can reveal information and reduce uncertainty around the location of one or more extreme points of the objective function for another task. For example, an objective function associated with task “A” of identifying hyperparameter values to optimize the performance of a machine learning system on a large set of data (eg 10 million data points) is more costly for evaluate (for each set of hyperparameter values) that an objective function associated with the related task “B” of identifying hyperparameter values to optimize the performance of a machine learning system in a subset of the data (for example, 10,000 out of 10 million data points). However, since the tasks are related (one task is a rough version of the other, similar to annealing), objective function evaluations for task “B” can reveal information about which hyperparameter values to try to evaluate for task “ A ”, thereby reducing the number of computationally costly assessments of the objective function for task“ A ”.
[00185] As another non-limiting example, in some modalities, the Bayesian multitasking optimization techniques described in this document can be applied to identify a hyperparameter value of a machine learning system that
82/95 assumes distinct values that are not ordered in a natural way (a categorical parameter). A non-limiting example of such a hyperparameter for a machine learning system is the type of nonlinearity used in a neural network (for example, a hyperbolic tangent nonlinearity, a sigmoid nonlinearity, etc.). Another non-limiting example of such a hyperparameter for a machine learning system is the type of kernel used in a support vector machine. In yet another non-limiting example of such a hyperparameter is a parameter that selected a training algorithm for a machine learning system from a set of different training algorithms that can be used to train the machine learning system in the same data set. Multitasking optimization techniques can be applied to such problems by generating a set of related tasks that have a task for each value of the categorical hyperparameters. Each task comprises identification values for all hyperparameters of a machine learning system, in addition to the values of the one or more categorical hyperparameters of the values that are defined for each task for one of the possible sets of values (for example, a task can understand identification values of the hyperparameters of a neural network with the use of a hyperbolic tangent as the activation function and another related task can understand identification values of the neural network with the use of a sigmoid as the activation function).
[00186] It should be noted that the examples above of the problems to which the Bayesian multitasking optimization techniques described in this document can be applied are illustrative and not limiting, also the multitasking techniques described in this document can be applied to any other suitable set optimization tasks.
83/95
[00187] In some modalities, the multitasking optimization techniques may comprise a joint probabilistic model to jointly model multiple objective functions, each of the objective functions corresponding to one of the multiple related tasks. As discussed above, multitasking optimization techniques can be applied to any suitable set of related optimization tasks. As a non-limiting example, each task can comprise identification hyperparameters to optimize the performance of the same machine learning system for a set of data associated with the task and used to train the machine learning system given a set of hyperparameter values. As another non-limiting example, one of the multiple related tasks can comprise identification hyperparameters to optimize the performance of a machine learning system for a first associated set data, and another task of the multiple related tasks can comprise identification hyperparameters to optimize the performance of another machine learning system related to a second data set (the first data set may be different from or the same as the second data set). In each of these examples, the objective function corresponding to a particular task can relate the machine learning system's hyperparameter values to its performance.
[00188] In some modalities, the joint probabilistic model of multiple objective functions can model the correlation between tasks in the plurality of tasks. In some embodiments, the joint probabilistic model may comprise one or more parameters to model the correlation between tasks in the plurality of tasks (for example, one or more parameters to specify a correlation or covariance kernel). The values of these parameter (s) can
84/95 (m) be estimated based on the results of the assessment of objective functions corresponding to the plurality of tasks. The values of the parameter (s) can be updated when one or more additional evaluations among any of the multiple objective functions are performed. Thus, the parameter (s) of the joint probabilistic model that models the correlation between tasks in the plurality of tasks can be estimated in an adaptable way.
[00189] For example, in some modalities, the joint probabilistic model of multiple objective functions can comprise a covariance kernel that can model the correlation between tasks in the plurality of tasks. In some embodiments, the covariance kernel (K mult ) can be obtained based, at least in part, on a first covariance kernel (K f ) that models the correlation between tasks in the plurality of tasks and a second covariance kernel (K x ) which models the correlation between the points at which objective functions in the plurality of objective functions can be assessed. The covariance kernel can be calculated from the first and the second covariance kernel according to:
((x. íf Λ'χ (X. X *}
........ (25) where ® represents the Kronecker product.
[00190] In some modalities, the joint probabilistic model of multiple objective functions can comprise a Gaussian process of vector values that can be used to model the mapping values of multiobjective f functions in the X domain to the R T range, where R is the set of real numbers and T is an integer greater than or equal to two. The X domain can be multidimensional. In this way, each objective multifunction f modeled by a Gaussian process of vector values maps the inputs into outputs T that correspond to the related tasks T, in which each of the outputs
85/95
T is an output for a corresponding task. In some modalities, the covariance kernel of the Gaussian process can be given by Equation (25), with the K x kernel specified through any of the kernel functions described in this document (for example, Matérn kernel). Although it should be noted that the joint probabilistic model of multiple objective functions is not limited to understanding a Gaussian process and can comprise any other suitable probabilistic model.
[00191] In some modalities, the K f kernel can be estimated from the evaluations of the multiple objective functions. Any suitable estimation technique can be used to estimate the K f kernel. For example, in some modalities, cut sampling (or any other suitable Monte Carlo technique) can be used to estimate a Cholesky factor of the K t kernel. In some modalities, the K t kernel is estimated to be subject to the constraint that related tasks are positively correlated. In such modalities, the elements of K t can be estimated in log space and exponentiated accordingly so that this restriction is satisfied. It should be noted that any suitable parameterization of a covariance kernel can be used, as aspects of the technology described in this document are not limited to any parameterization (for example, Cholesky) of the covariance kernel.
[00192] Figure 7 is a flowchart of an illustrative process 700 to perform Bayesian multitasking optimization with the use of a set of objective functions, in which each of the objective functions in the set is associated with a respective task in a set of tasks related. The feature set can comprise any suitable number of functions (for example, two, three, five, at least two, at least five, at least ten, at least 25, at least 50, between 2 and 25, between 10 and 100 , etc.). The process
86/95
700 can be used to identify an extreme point (eg, a local minimum, local maximum, global minimum, global maximum, etc.) of each or more of the objective functions using the techniques described in this document.
[00193] Process 700 can be performed with one or more computer hardware processors, as the aspects of the technology described in this document are not limited in this sense. When process 700 is performed using multiple computer hardware processors, its execution can be paralleled through multiple processors according to the techniques described above in relation to Figure 6.
[00194] In some modalities, process 700 can be applied to identify (for example, locate or approximate the locations of) one or more extreme points of one or more objective functions that relate values of hyperparameters of a machine learning system to respective values that provide a measure of machine learning system performance. Process 700 can be used to define hyperparameter values for any of the machine learning systems described in this document and / or any other suitable machine learning systems. In addition or alternatively, process 700 may be applied to identify (for example, locate or approximate locations to) one or more extreme points of one or more objective functions that arise in any other related optimization tasks.
[00195] Process 700 starts at action 702, in which a probabilistic model of the objective functions in the set of objective functions is initialized. The joint probabilistic model can be any suitable probabilistic model. As a non-limiting example, in some modalities, the joint probabilistic model can comprise a Gaussian process of specified vector values
87/95 with the use of a covariance kernel given by Equation (25). However, in other modalities, the Gaussian process can be specified using any other suitable kernel, and in still other modalities, the joint probabilistic model may not comprise a Gaussian process and may instead comprise a neural network, a model of function regression with adaptive basis (with functions that have multiple outputs), or any other suitable probabilistic model. The joint probabilistic model can be initialized in any suitable way (for example, as described in relation to action 402 of process 400), as the aspects of the technology described in this document are not limited by the way in which the multiple function joint probabilistic model lenses is initialized.
[00196] The next process 700 proceeds to action 704, in which a point is identified in which part of the objective function is evaluated in the set of objective functions. The point can be identified based at least in part using the joint probabilistic model of the objective functions and an acquisition utility function (which may depend on the joint probabilistic model of the objective function). Any of the various types of acquisition utility functions can be used after being properly generalized for the definition of multitasking. As a non-limiting example, in some modalities, the entropy search acquisition function (see, for example, Equation 9) can be generalized for the case of multitasking and the point at which an objective function is evaluated in the set of objective functions can be identified based on the joint probabilistic model and the generalized entropy search acquisition function.
[00197] In some modalities, the entropy search acquisition function can be generalized to take into account the cost
88/95
computation of the evaluation of objective functions in the set of objective functions. The resulting acquisition function ai G (x), called the weighted entropy search utility utility function, can be computed according to:
p (f | x # ) djrdf (26) where, p () is the probabilistic model set of objective functions in the set of objective functions, 1 min indicates that the imagined observation {x l , y} was added to the set of observations , where x l is an imagined point at which the objective function associated with the thesis task can be evaluated, the value of the objective function at x l is imagined when evaluating the entropy search acquisition function, p (f | x l ) represents p (f | x Θ, {xn l , yn l ; 1 <n <N}), H (P) represents the entropy of P, Pmin represents Pr (minimum in χ ι | θ, X, {χη yn ', 1 η <N}), and each x n l corresponds to a point at which the objective function associated with the t-th task was evaluated to obtain the result of the evaluation y n The function c t (x ) represents the cost of assessing the objective function associated with the t-th task at point x. This cost function can be known in advance or, in some modalities, it can be estimated based on one or more evaluations of the objective functions in the set of objective functions (along the information that indicates how long each evaluation took to complete) . The weighted entropy search acquisition function may reflect the information gain (from the evaluation of the objective function fnth at point x) per unit cost of the evaluation of a candidate point.
[00198] The point at which an objective function is evaluated in the set of objective functions can be identified as the point (or as an approximation of the point) at which the acquisition utility function (for example,
89/95 example, the weighted entropy search utility function reaches its maximum value. In some modalities, the point at which the acquisition function reaches its maximum can be identified exactly (for example, when the acquisition utility function is available in closed form). In some modalities, however, the point at which the acquisition utility function reaches its maximum value may not be identified exactly (for example, due to the fact that the acquisition utility function is not available in closed form), at whose if the point at which the acquisition utility function reaches its maximum value can be identified or approximated using numerical techniques. For example, the cost-weighted entropy search utility function may not be available in closed form and Monte Carlo techniques (for example, rejection sampling, importance sampling, Markov current Monte Carlo, etc.). ) can be used to identify or approximate the point at which the integrated acquisition utility function reaches its maximum value. [00199] It should be noted that the point at which an objective function is evaluated in the set of objective functions is not limited to being a point (or an approximation to the point) at which the acquisition utility function reaches its maximum and can be any other suitable point with the use of the acquisition utility function (for example, a local maximum of the acquisition utility function, a local or global minimum of the acquisition utility function, etc.).
[00200] After the point at which an objective function is evaluated in the set of objective functions is identified in action 704, process 700 proceeds to action 706, in which an objective function is selected from the set of objective functions to be evaluated at the point identified in action 704. The objective function to be evaluated at the identified point can be selected based, at least on
90/95 part, in the joint probabilistic model. As a non-limiting example, the objective function to be evaluated is selected to be the objective function that has the highest probability, according to the joint probabilistic model, to generate the highest corresponding value at the identified point.
[00201] Next, process 700 proceeds to action 708, in which the objective function selected in action 706 is evaluated at the point identified in action 704. Next, process 700 proceeds to action 710, in which the probabilistic model can be updated based on the results of the evaluation carried out in action 708 to obtain an updated probabilistic model.
[00202] The joint probabilistic model can be updated in any of several ways based on the results of the new evaluation obtained in action 708. For example, updating the joint probabilistic model may include updating (for example, re-estimating) one or more parameters of the probabilistic model based on the results of the evaluation carried out in action 708. As a non-limiting example, updating the joint probabilistic model can include updating one or more parameters in the joint probabilistic model used to model the correlation between tasks in plurality of tasks (for example, one or more parameters to specify a correlation or covariance kernel). As another non-limiting example, updating the joint probabilistic model may comprise updating one or more parameters of the acquisition utility function (for example, one or more parameters of the weighted entropy search acquisition function c t (x )). Additionally or alternatively, the joint probabilistic model can be updated in any of the ways described in relation to action 408 of process 400 and / or in any other suitable way.
[00203] After the joint probabilistic model is updated in
91/95 action 710, process 700 proceeds to decision block 712, in which it is determined whether any of the objective functions in the set of objective functions should be evaluated elsewhere. This determination can be made in any appropriate way. For example, this determination can be made for each of the objective functions in the set of objective functions in any of the ways described in relation to decision block 410 of process 400, and if it is determined that any of the objective functions must be updated, process 700 returns to action 704 via derivation “YES”, and actions 704 to 710 and decision block 712 are repeated.
[00204] On the one hand, if it is determined that none of the objective functions in the set of objective functions should be updated, process 700 proceeds through the “NO” derivation for action 714, in which an extreme value of one or more of the functions objectives in the set of objective functions can be identified. The extreme value of an objective function in the set of objective functions can be found in any suitable form and, for example, it can be found in any of the ways described in relation to action 412 of process 400. After the extreme value of one or more of the objective functions to be identified in action 714, process 700 is completed.
[00205] It should be noted that process 700 is illustrative and that several variations of process 700 are possible. For example, although in the illustrated modality a point at which part of the objective function is evaluated was identified first in action 704, and an objective function to be evaluated at the identified point was selected second in action 706, in other modalities the order of two steps can be reversed. Thus, in some modalities, a task in which an objective function is evaluated can be selected first and a point in which the selected task is evaluated can be selected.
92/95 identified in second place.
[00206] As another example, the joint probabilistic model of objective functions can be specified using one or more non-linear mappings (for example, each task can be associated with a respective non-linear mapping), which can be useful in a variety of problems. For example, when training a machine learning system on different data sets, the size of the data set can have an effect in which hyperparameter definitions will result in good performance of the machine learning system. For example, a machine learning system trained using a small data set may require more regularization than in a case where the same machine learning system is trained on a larger data set (for example, so that the hyperparameters indicating an amount of smoothing may be different for machine learning systems trained on small and large amounts of data). More generally, it is possible that a part of the input space for one task can be correlated with a different part of the input space for the other task. Allowing each task to be associated with its own respective nonlinear distortion (for example, as described above for a single task) can allow the joint probabilistic model to justify such an inter-task correlation. Inferring that the parameters associated with non-linear distortions (for example, the parameters of associated cumulative distribution functions, etc.) can distort tasks in a stationary space jointly modeled more appropriately by a stationary multitasking model (for example , a multitasking model specified using a stationary Gaussian process of vector values).
[00207] An illustrative implantation of a computer system
93/95
800 that can be used in conjunction with any of the technology modalities described in this document is shown in Figure 8. The computer system 800 can include one or more 810 processors and one or more articles of manufacture comprising a readable storage medium. non-transitory computer (for example, 820 memory and one or more 830 non-volatile storage media). The processor 810 can control data writing to and reading data from memory 820 and non-volatile storage device 820 in any suitable manner, since the technology aspects described in the present document are not limited in this relationship. To perform any of the features described in this document, the 810 processor may execute one or more executable instructions per processor stored in one or more non-transitory computer-readable storage media (for example, the 820 memory), which can serve as a medium non-transient computer-readable storage that stores executable instructions per processor for execution by the 810 processor.
[00208] The terms "program" or "software" are used in this document in a generic sense to refer to any type of computer code or set of executable instructions per processor that can be applied to the program of a computer or other processor to implement various aspects of the modalities as discussed above. In addition, it should be noted that according to one aspect, one or more computer programs that when executed perform the technology methods described in this document need not reside on a single computer or processor, however, they can be distributed in a modular way among different computers or processors to deploy different aspects of the technology described in the present
94/95 document.
[00209] The instructions executable per processor can be in several forms, such as program modules, executed by one or more computers or other devices. Program modules typically include routines, programs, objects, components, data structures and the like that perform tasks or deploy particular abstract data types. Typically, the functionality of the program modules can be combined or distributed as desired in several modalities.
[00210] Furthermore, data structures can be stored in one or more non-transitory computer-readable storage media in any suitable form. For the sake of simplicity of illustration, data structures can be shown to have fields that are related by location in the data structure. Such relationships can similarly be achieved by assigning storage to fields with locations on non-transitory, computer-readable media that carry the relationship between fields. However, any suitable mechanism can be used to establish relationships between information in fields in a data structure, including through the use of indicators, markings or other mechanisms that establish relationships between data elements.
[00211] Furthermore, several inventive concepts can be incorporated as one or more processes, of which examples (Figures 4, 6 and 7) have been provided. The actions carried out as part of each process can be ordered in any appropriate way. In this way, the modalities can be constructed in which the actions are performed in a different order than the one illustrated, which may include performing part of the actions simultaneously, although they are shown as sequential actions in the illustrative modalities.
95/95
[00212] The use of common terms such as first, second, third, etc. in claims to modify a claim element they do not by themselves connote any priority, precedence or order of one claim element over another or the temporal order in which the actions of a method are carried out. Such terms are used as labels to distinguish a claim element that has a certain name from another element that has the same name (however, for the use of the common term).
[00213] The phraseology and terminology used in the present invention are for the purpose of description, and should not be considered as limiting. The use of which includes, which understands, which has, which contains, which involves and variations thereof are intended to cover the items listed above and additional items.
权利要求:
Claims (10)
[1]
1. System for use in connection with the realization of optimization using a plurality of objective functions associated with a respective plurality of tasks characterized by the fact that it comprises:
at least one computer hardware processor; and at least one non-transitory computer-readable storage medium that stores executable instructions per processor that, when executed by at least one computer hardware processor, cause at least one computer hardware processor to perform:
the identification, based, at least in part, on a joint probabilistic model of the plurality of objective functions, of a first point in which an objective function is evaluated in the plurality of objective functions;
the selection, based, at least in part, on the joint probabilistic model, of a first objective function in the plurality of objective functions for evaluation in the first identified point;
the assessment of the first objective function at the first identified point; and updating the joint probabilistic model based on the results of the assessment to obtain an updated joint probabilistic model.
[2]
2. System, according to claim 1, characterized by the fact that the first objective function relates hyperparameter values of a machine learning system to values that provide a measurement of the performance of the machine learning system.
[3]
3. System, according to claim 1 or 2, features
Petition 870170035044, of 05/25/2017, p. 5/11
2/4 realized by the fact that the first objective function relates values of a plurality of hyperparameters of a neural network to identify objects in images to respective values that provide a measurement of the performance of the neural network in the identification of objects in images.
[4]
4. System according to any one of claims 1 to 3, characterized in that the executable instructions per processor still cause the at least one computer hardware processor to perform:
the identification, based, at least in part, on the updated joint probabilistic model of the plurality of objective functions, and a second point on which an objective function is evaluated in the plurality of objective functions;
the selection, based, at least in part, on the joint probabilistic model, of a second objective function in the plurality of objective functions for evaluation in the first identified point; and the evaluation of the second objective function at the first identified point; and where the first objective function is different from the second objective function.
[5]
5. System according to any one of claims 1 to 4, characterized by the fact that the joint probabilistic model of the plurality of objective functions comprises a Gaussian process of vector values.
[6]
6. System according to any one of claims 1 to 5, characterized by the fact that the identification is carried out additionally based on a cost-weighted entropy search utility function.
[7]
7. Method for use in connection with performing optimization using a plurality of objective functions associated with an
Petition 870170035044, of 05/25/2017, p. 6/11
3/4 respective plurality of tasks characterized by the fact that it comprises the stage of:
use at least computer hardware processor to perform:
the identification, based, at least in part, on a joint probabilistic model of the plurality of objective functions, of a first point in which an objective function is evaluated in the plurality of objective functions;
the selection, based, at least in part, on the joint probabilistic model, of a first objective function in the plurality of objective functions for evaluation in the first identified point;
the assessment of the first objective function at the first identified point; and updating the joint probabilistic model based on the results of the assessment to obtain an updated joint probabilistic model.
[8]
8. Method, according to claim 7, characterized by the fact that the first objective function relates values of hyperparameters of a machine learning system to values that provide a measurement of the performance of the machine learning system.
[9]
9. Method, according to claim 7 or 8, characterized by the fact that the joint probabilistic model of the plurality of objective functions comprises a Gaussian process of vector values.
[10]
10. Non-transitory computer-readable storage medium characterized by the fact that it stores executable instructions per processor that, when executed by at least one computer hardware processor, make at least
Petition 870170035044, of 05/25/2017, p. 7/11
4/4 a computer hardware processor performs a method as defined in any of claims 7 to 9 for use in connection with carrying out the optimization using a plurality of objective functions associated with a respective plurality of tasks.
类似技术:
公开号 | 公开日 | 专利标题
BR112015029806A2|2020-04-28|systems and methods for performing Bayesian optimization
Vannieuwenhoven et al.2012|A new truncation strategy for the higher-order singular value decomposition
Deleforge et al.2015|High-dimensional regression with gaussian mixtures and partially-latent response variables
Geppert et al.2017|Random projections for Bayesian regression
Moores et al.2015|Pre-processing for approximate Bayesian computation in image analysis
US20180349158A1|2018-12-06|Bayesian optimization techniques and applications
Byrne2009|Block‐iterative algorithms
Li et al.2016|Approximating cross-validatory predictive evaluation in Bayesian latent variable models with integrated IS and WAIC
Burger2016|Bregman distances in inverse problems and partial differential equations
Lee et al.2016|Streamlined mean field variational Bayes for longitudinal and multilevel data analysis
Li et al.2019|A novel deep neural network method for electrical impedance tomography
Tibbits et al.2014|Automated factor slice sampling
Papastamoulis2014|Handling the label switching problem in latent class models via the ECR algorithm
Langone et al.2017|Fast kernel spectral clustering
Bardsley et al.2012|An MCMC method for uncertainty quantification in nonnegativity constrained inverse problems
De Wiljes et al.2013|An adaptive Markov chain Monte Carlo approach to time series clustering of processes with regime transition behavior
Mandal et al.2016|l1 regularized multiplicative iterative path algorithm for non-negative generalized linear models
Minervini et al.2015|Large-scale analysis of neuroimaging data on commercial clouds with content-aware resource allocation strategies
Audelan et al.2020|Robust fusion of probability maps
Zaeemzadeh et al.2018|A Bayesian approach for asynchronous parallel sparse recovery
Chatrabgoun et al.2016|Copula density estimation using multiwavelets based on the multiresolution analysis
Scheinberg et al.2010|Sparse Markov net learning with priors on regularization parameters.
Akrami et al.2021|Quantile Regression for Uncertainty Estimation in VAEs with Applications to Brain Lesion Detection
Maisog2009|Non-negative matrix factorization: Assessing methods for evaluating the number of components and the effect of normalization thereon
Li et al.2020|Regularized Optimal Transport
同族专利:
公开号 | 公开日
CA2913743A1|2014-12-04|
WO2014194161A3|2015-01-29|
JP2016523402A|2016-08-08|
US10346757B2|2019-07-09|
US10074054B2|2018-09-11|
KR20160041856A|2016-04-18|
US20140358831A1|2014-12-04|
US20160328655A1|2016-11-10|
EP3000053A4|2017-10-04|
EP3000053A2|2016-03-30|
KR102219346B1|2021-02-23|
US20200027012A1|2020-01-23|
KR20210021147A|2021-02-24|
JP6483667B2|2019-03-13|
US20160292129A1|2016-10-06|
US20160328653A1|2016-11-10|
HK1223430A1|2017-07-28|
US9858529B2|2018-01-02|
WO2014194161A2|2014-12-04|
US9864953B2|2018-01-09|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

JPH06149866A|1992-11-09|1994-05-31|Ricoh Co Ltd|Solution searching device|
US6735596B2|2001-06-07|2004-05-11|Guy Charles Corynen|Computer method and user interface for decision analysis and for global system optimization|
US20060200333A1|2003-04-10|2006-09-07|Mukesh Dalal|Optimizing active decision making using simulated decision making|
EP1598751B1|2004-01-12|2014-06-25|Honda Research Institute Europe GmbH|Estimation of distribution algorithm |
US7509259B2|2004-12-21|2009-03-24|Motorola, Inc.|Method of refining statistical pattern recognition models and statistical pattern recognizers|
US8301390B2|2007-01-31|2012-10-30|The Board Of Trustees Of The University Of Illinois|Quantum chemistry simulations using optimization methods|
US8315960B2|2008-11-11|2012-11-20|Nec Laboratories America, Inc.|Experience transfer for the configuration tuning of large scale computing systems|
US8811726B2|2011-06-02|2014-08-19|Kriegman-Belhumeur Vision Technologies, Llc|Method and system for localizing parts of an object in an image for computer vision applications|
US8924315B2|2011-12-13|2014-12-30|Xerox Corporation|Multi-task learning using bayesian model with enforced sparsity and leveraging of task correlations|
US9858529B2|2013-05-30|2018-01-02|President And Fellows Of Harvard College|Systems and methods for multi-task Bayesian optimization|US9672193B2|2013-03-15|2017-06-06|Sas Institute Inc.|Compact representation of multivariate posterior probability distribution from simulated samples|
US9858529B2|2013-05-30|2018-01-02|President And Fellows Of Harvard College|Systems and methods for multi-task Bayesian optimization|
WO2015013283A2|2013-07-22|2015-01-29|Texas State University|Autonomous performance optimization in robotic assembly process|
US9390712B2|2014-03-24|2016-07-12|Microsoft Technology Licensing, Llc.|Mixed speech recognition|
CA2896052A1|2014-07-04|2016-01-04|Tata Consultancy Services Limited|System and method for prescriptive analytics|
US10120962B2|2014-09-02|2018-11-06|International Business Machines Corporation|Posterior estimation of variables in water distribution networks|
US10275719B2|2015-01-29|2019-04-30|Qualcomm Incorporated|Hyper-parameter selection for deep convolutional networks|
JP6388074B2|2015-03-26|2018-09-12|日本電気株式会社|Optimization processing apparatus, optimization processing method, and program|
CN106156807B|2015-04-02|2020-06-02|华中科技大学|Training method and device of convolutional neural network model|
US9734436B2|2015-06-05|2017-08-15|At&T Intellectual Property I, L.P.|Hash codes for images|
US10755810B2|2015-08-14|2020-08-25|Elucid Bioimaging Inc.|Methods and systems for representing, storing, and accessing computable medical imaging-derived quantities|
US11150921B2|2015-09-01|2021-10-19|International Business Machines Corporation|Data visualizations selection|
CN106570513B|2015-10-13|2019-09-13|华为技术有限公司|The method for diagnosing faults and device of big data network system|
CH711716A1|2015-10-29|2017-05-15|Supsi|Learning the structure of Bayesian networks from a complete data set|
JP6470165B2|2015-12-15|2019-02-13|株式会社東芝|Server, system, and search method|
US11062229B1|2016-02-18|2021-07-13|Deepmind Technologies Limited|Training latent variable machine learning models using multi-sample objectives|
CN105590623B|2016-02-24|2019-07-30|百度在线网络技术(北京)有限公司|Letter phoneme transformation model generation method and device based on artificial intelligence|
US10235443B2|2016-03-01|2019-03-19|Accenture Global Solutions Limited|Parameter set determination for clustering of datasets|
JP6703264B2|2016-06-22|2020-06-03|富士通株式会社|Machine learning management program, machine learning management method, and machine learning management device|
US10789538B2|2016-06-23|2020-09-29|International Business Machines Corporation|Cognitive machine learning classifier generation|
US10789546B2|2016-06-23|2020-09-29|International Business Machines Corporation|Cognitive machine learning classifier generation|
US10579729B2|2016-10-18|2020-03-03|International Business Machines Corporation|Methods and system for fast, adaptive correction of misspells|
US10372814B2|2016-10-18|2019-08-06|International Business Machines Corporation|Methods and system for fast, adaptive correction of misspells|
WO2018089451A1|2016-11-09|2018-05-17|Gamalon, Inc.|Machine learning data analysis system and method|
RU2641447C1|2016-12-27|2018-01-17|Общество с ограниченной ответственностью "ВижнЛабс"|Method of training deep neural networks based on distributions of pairwise similarity measures|
US10740880B2|2017-01-18|2020-08-11|Elucid Bioimaging Inc.|Systems and methods for analyzing pathologies utilizing quantitative imaging|
US20180239851A1|2017-02-21|2018-08-23|Asml Netherlands B.V.|Apparatus and method for inferring parameters of a model of a measurement structure for a patterning process|
US20180349158A1|2017-03-22|2018-12-06|Kevin Swersky|Bayesian optimization techniques and applications|
US20200259842A1|2017-09-25|2020-08-13|Sony Corporation|Verification apparatus, information processing method, and program|
US10282237B1|2017-10-30|2019-05-07|SigOpt, Inc.|Systems and methods for implementing an intelligent application program interface for an intelligent optimization platform|
KR102107378B1|2017-10-31|2020-05-07|삼성에스디에스 주식회사|Method For optimizing hyper-parameter automatically and Apparatus thereof|
US11270217B2|2017-11-17|2022-03-08|Intel Corporation|Systems and methods implementing an intelligent machine learning tuning system providing multiple tuned hyperparameter solutions|
JP2019111604A|2017-12-22|2019-07-11|セイコーエプソン株式会社|Control device, robot and robot system|
JP6856557B2|2018-01-22|2021-04-07|株式会社日立製作所|Optimization device and hyperparameter optimization method|
JP2019192608A|2018-04-27|2019-10-31|国立研究開発法人物質・材料研究機構|Structure with narrow band thermal emission spectrum|
US10565085B2|2018-06-06|2020-02-18|Sas Institute, Inc.|Two-stage distributed estimation system|
KR102173243B1|2018-06-14|2020-11-03|밸류파인더스|Methode for Performance Improvement of Portfolio Asset Allocation Using Recurrent Reinforcement Learning|
KR102063791B1|2018-07-05|2020-01-08|국민대학교산학협력단|Cloud-based ai computing service method and apparatus|
CN109242959B|2018-08-29|2020-07-21|清华大学|Three-dimensional scene reconstruction method and system|
EP3620996A1|2018-09-04|2020-03-11|Siemens Aktiengesellschaft|Transfer learning of a machine-learning model using a hyperparameter response model|
US20200143231A1|2018-11-02|2020-05-07|Microsoft Technology Licensing, Llc|Probabilistic neural network architecture generation|
EP3887913A2|2018-11-28|2021-10-06|Tata Consultancy Services Ltd.|System and method for operation optimization of an equipment|
KR102105787B1|2019-01-28|2020-04-29|한국과학기술원|Apparatus and method for controlling camera attribute using bayesian optimization|
JP2020144530A|2019-03-05|2020-09-10|日本電信電話株式会社|Parameter estimation device, method, and program|
US11157812B2|2019-04-15|2021-10-26|Intel Corporation|Systems and methods for tuning hyperparameters of a model and advanced curtailment of a training of the model|
JP2020181318A|2019-04-24|2020-11-05|日本電信電話株式会社|Optimization device, optimization method, and program|
JPWO2020235104A1|2019-05-23|2020-11-26|
CN110263938B|2019-06-19|2021-07-23|北京百度网讯科技有限公司|Method and apparatus for generating information|
US10984507B2|2019-07-17|2021-04-20|Harris Geospatial Solutions, Inc.|Image processing system including training model based upon iterative blurring of geospatial images and related methods|
WO2021007812A1|2019-07-17|2021-01-21|深圳大学|Deep neural network hyperparameter optimization method, electronic device and storage medium|
US11068748B2|2019-07-17|2021-07-20|Harris Geospatial Solutions, Inc.|Image processing system including training model based upon iteratively biased loss function and related methods|
KR20210026623A|2019-08-30|2021-03-10|삼성전자주식회사|System and method for training artificial intelligence model|
EP3786857A1|2019-09-02|2021-03-03|Secondmind Limited|Computational implementation of gaussian process models|
US11003825B1|2019-09-26|2021-05-11|Cadence Design Systems, Inc.|System, method, and computer program product for optimization in an electronic design|
WO2021067358A1|2019-10-01|2021-04-08|Ohio State Innovation Foundation|Optimizing reservoir computers for hardware implementation|
WO2021066504A1|2019-10-02|2021-04-08|한국전자통신연구원|Deep neutral network structure learning and simplifying method|
CN111027709B|2019-11-29|2021-02-12|腾讯科技(深圳)有限公司|Information recommendation method and device, server and storage medium|
CN111783293A|2020-06-24|2020-10-16|西北工业大学|Method for analyzing post-buckling reliability of composite material stiffened wall panel based on self-adaptive important sampling|
KR102314847B1|2021-03-30|2021-10-19|주식회사 솔리드웨어|Optimal model seeking method and apparatus|
法律状态:
2018-11-06| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|
2020-04-22| B06U| Preliminary requirement: requests with searches performed by other patent offices: procedure suspended [chapter 6.21 patent gazette]|
2021-07-20| B350| Update of information on the portal [chapter 15.35 patent gazette]|
2021-10-19| B350| Update of information on the portal [chapter 15.35 patent gazette]|
2021-12-07| B07A| Application suspended after technical examination (opinion) [chapter 7.1 patent gazette]|
优先权:
申请号 | 申请日 | 专利标题
US201361829090P| true| 2013-05-30|2013-05-30|
US201361829604P| true| 2013-05-31|2013-05-31|
US201361910837P| true| 2013-12-02|2013-12-02|
PCT/US2014/040141|WO2014194161A2|2013-05-30|2014-05-30|Systems and methods for performing bayesian optimization|
[返回顶部]